{"id":1063,"date":"2019-07-09T10:50:58","date_gmt":"2019-07-09T10:50:58","guid":{"rendered":"https:\/\/www.testpreptraining.com\/tutorial\/?page_id=1063"},"modified":"2022-03-03T06:56:28","modified_gmt":"2022-03-03T06:56:28","slug":"determine-the-operational-characteristics-of-the-collection-system","status":"publish","type":"page","link":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/","title":{"rendered":"Determine the Operational Characteristics of the Collection System"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"> <strong> <strong>AWS Big Data Exam<\/strong> updated to <a href=\"https:\/\/www.testpreptraining.ai\/aws-certified-data-analytics-specialty-exam\" target=\"_blank\" rel=\"noreferrer noopener\">AWS Certified Data Analytics Specialty.<\/a><\/strong> <\/h2>\n\n\n\n<p>Big Data 4Vs or features<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Volume\n\u2013 It is related to enormous size. <\/li><li>Variety\n\u2013 It is the heterogeneous sources and the nature of data, both structured and\nunstructured. It can be emails, photos, videos, monitoring devices, PDFs,\naudio, etc. for analysis. <\/li><li>Velocity\n\u2013 It is the speed of generation of data or speed at which data flows in from\nsources though the flow of data is massive and continuous.<\/li><li>Variability\n\u2013 It refers to the inconsistency in the data . <\/li><\/ul>\n\n\n\n<p>Characteristics of data collection<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Validity\nis the degree to which the tool measures what it is intended to measure. Weighing\nscale measures body weight and its valid; a tool which is valid for one\nmeasure, need not be valid for another. <\/li><li>Reliability\nindicates the accuracy and consistency of input data.<\/li><li>Sensitivity\nrefers to the capability to detect changes or difference when they to occur. <\/li><li>Objectivity\nmeans freedom from bias. <\/li><li>Economy:\nCost and resource needed <\/li><li>Practicability:\nSimplicity of administration, scoring and interpretations.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Stream Processing<\/h2>\n\n\n\n<p>Streaming Data is data that is generated continuously\nby thousands of data sources, which typically send in the data records\nsimultaneously, and in small sizes (order of Kilobytes). Streaming data\nincludes a wide variety of data such as log files generated by customers using\nyour mobile or web applications, ecommerce purchases, in-game player activity,\ninformation from social networks, financial trading floors, or geospatial\nservices, and telemetry from connected devices or instrumentation in data\ncenters.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td>\n  Batch processing\n  <\/td><td>\n  Stream processing\n  <\/td><\/tr><tr><td>\n  Data scope\n  <\/td><td>\n  Queries or processing over all or most of the data in\n  the dataset.\n  <\/td><td>\n  Queries or processing over data within a rolling time\n  window, or on just the most recent data record.\n  <\/td><\/tr><tr><td>\n  Data size\n  <\/td><td>\n  <br>\n  Large batches of data.\n  <\/td><td>\n  Individual records or micro batches consisting of a\n  few records.\n  <\/td><\/tr><tr><td>\n  Performance\n  <\/td><td>\n  Latencies in minutes to hours.\n  <\/td><td>\n  Requires latency in the order of seconds or\n  milliseconds.\n  <\/td><\/tr><tr><td>\n  Analyses\n  <\/td><td>\n  Complex analytics.\n  <\/td><td>\n  Simple response functions, aggregates, and rolling\n  metrics.\n  <\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Stream\nProcessing Challenges<\/strong> &#8211; Processing real-time data as it arrives can enable you\nto make decisions much faster than is possible with traditional data analytics\ntechnologies. However, building and operating your own custom streaming data\npipelines is complicated and resource intensive. You have to build a system\nthat can cost-effectively collect, prepare, and transmit data coming\nsimultaneously from thousands of data sources. You need to fine-tune the\nstorage and compute resources so that data is batched and transmitted\nefficiently for maximum throughput and low latency. You have to deploy and\nmanage a fleet of servers to scale the system so you can handle the varying\nspeeds of data you are going to throw at it. After you have built this\nplatform, you have to monitor the system and recover from any server or network\nfailures by catching up on data processing from the appropriate point in the\nstream, without creating duplicate data. All of this takes valuable time and\nmoney and, at the end of the day, most companies just never get there and must\nsettle for the status-quo and operate their business with information that is\nhours or days old.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">AWS Services for Collection of Different Data Types<\/h2>\n\n\n\n<p>Real Time &#8211; Immediate actions<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Kinesis Data Streams (KDS)<\/li><li>Simple Queue Service (SQS)<\/li><li>Internet of Things (IoT)<\/li><\/ul>\n\n\n\n<p>Near-real time &#8211; Reactive actions<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Kinesis Data Firehose (KDF)<\/li><li>Database Migration Service (DMS)<\/li><\/ul>\n\n\n\n<p>Batch &#8211; Historical Analysis<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Snowball<\/li><li>Data Pipeline<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Amazon Kinesis<\/h2>\n\n\n\n<p>Use Amazon Kinesis Data Streams to collect and process\nlarge streams of data records in real time. You can create data-processing\napplications, known as Kinesis Data Streams applications. A typical Kinesis\nData Streams application reads data from a data stream as data records.<\/p>\n\n\n\n<p>You can use Kinesis Data Streams for rapid and continuous\ndata intake and aggregation. The type of data used can include IT\ninfrastructure log data, application logs, social media, market data feeds, and\nweb clickstream data. Because the response time for the data intake and\nprocessing is in real time, the processing is typically lightweight. <\/p>\n\n\n\n<p>The following diagram illustrates the high-level architecture of Kinesis Data Streams. The producers continually push data to Kinesis Data Streams, and the consumers process the data in real time. Consumers (such as a custom application running on Amazon EC2 or an Amazon Kinesis Data Firehose delivery stream) can store their results using an AWS service such as Amazon DynamoDB, Amazon Redshift, or Amazon S3. <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"323\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-750x323.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1175\" srcset=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-750x323.png 750w, https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system.png 1102w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/figure><\/div>\n\n\n\n<p><strong>Benefits<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Real-time &#8211; Amazon Kinesis enables you to ingest,\nbuffer, and process streaming data in real-time, so you can derive insights in\nseconds or minutes instead of hours or days.<\/li><li>Fully managed &#8211; Amazon Kinesis is fully managed\nand runs your streaming applications without requiring you to manage any infrastructure.<\/li><li>Scalable &#8211; Amazon Kinesis can handle any amount\nof streaming data and process data from hundreds of thousands of sources with\nvery low latencies.<\/li><\/ul>\n\n\n\n<p><strong>Kinesis Data\nStreams Terminology<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Kinesis Data Stream &#8211; A Kinesis data stream is a\nset of shards. Each shard has a sequence of data records. Each data record has\na sequence number that is assigned by Kinesis Data Streams.<\/li><li>Data Record &#8211; A data record is the unit of data\nstored in a Kinesis data stream. Data records are composed of a sequence\nnumber, a partition key, and a data blob, which is an immutable sequence of\nbytes. Kinesis Data Streams does not inspect, interpret, or change the data in\nthe blob in any way. A data blob can be up to 1 MB.<\/li><li>Retention Period &#8211; The retention period is the\nlength of time that data records are accessible after they are added to the\nstream. A stream\u2019s retention period is set to a default of 24 hours after\ncreation. You can increase the retention period up to 168 hours (7 days) using\nthe IncreaseStreamRetentionPeriod operation, and decrease the retention period\ndown to a minimum of 24 hours using the DecreaseStreamRetentionPeriod\noperation. Additional charges apply for streams with a retention period set to\nmore than 24 hours. <\/li><li>Producer &#8211; Producers put records into Amazon Kinesis\nData Streams. For example, a web server sending log data to a stream is a\nproducer.<\/li><li>Consumer &#8211; Consumers get records from Amazon\nKinesis Data Streams and process them. These consumers are known as Amazon\nKinesis Data Streams Application.<\/li><li>Amazon Kinesis Data Streams Application &#8211; An\nAmazon Kinesis Data Streams application is a consumer of a stream that commonly\nruns on a fleet of EC2 instances. There are two types of consumers that you can\ndevelop: shared fan-out consumers and enhanced fan-out consumers. <\/li><li>Shard &#8211; A shard is a uniquely identified\nsequence of data records in a stream. A stream is composed of one or more\nshards, each of which provides a fixed unit of capacity. Each shard can support\nup to 5 transactions per second for reads, up to a maximum total data read rate\nof 2 MB per second and up to 1,000 records per second for writes, up to a\nmaximum total data write rate of 1 MB per second (including partition keys).\nThe data capacity of your stream is a function of the number of shards that you\nspecify for the stream. The total capacity of the stream is the sum of the\ncapacities of its shards.<\/li><li>Partition Key &#8211; A partition key is used to group\ndata by shard within a stream. Kinesis Data Streams segregates the data records\nbelonging to a stream into multiple shards. It uses the partition key that is\nassociated with each data record to determine which shard a given data record\nbelongs to. Partition keys are Unicode strings with a maximum length limit of\n256 bytes. An MD5 hash function is used to map partition keys to 128-bit\ninteger values and to map associated data records to shards. When an\napplication puts data into a stream, it must specify a partition key.<\/li><li>Sequence Number &#8211; Each data record has a\nsequence number that is unique per partition-key within its shard. Kinesis Data\nStreams assigns the sequence number after you write to the stream with\nclient.putRecords or client.putRecord. Sequence numbers for the same partition\nkey generally increase over time. The longer the time period between write requests,\nthe larger the sequence numbers become.<\/li><li>Kinesis Client Library &#8211; The Kinesis Client\nLibrary is compiled into your application to enable fault-tolerant consumption\nof data from the stream. The Kinesis Client Library ensures that for every\nshard there is a record processor running and processing that shard. The\nlibrary also simplifies reading data from the stream. The Kinesis Client\nLibrary uses an Amazon DynamoDB table to store control data. It creates one\ntable per application that is processing data.<\/li><\/ul>\n\n\n\n<p><strong>AWS Kinesis\nSummary<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>The unit of data stored by Kinesis Data Streams is a data record. A stream represents a group of data records. The data records in a stream are distributed into shards.<\/li><li>A shard has a sequence of data records in a stream. When you create a stream, you specify the number of shards for the stream. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second. Shards also support up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed. However, you are charged on a per-shard basis. Billing is per shard provisioned, can have as many shards as you want. Records are ordered per shard.<\/li><li>The unit of data of the Kinesis data stream, which is composed of a sequence number (The unique identifier of the record within its shard. Type: String), a partition key (Identifies which shard in the stream the data record is assigned to. Type: String ), and a data blob(The data in the blob is both opaque and immutable to Kinesis Data Streams, which does not inspect, interpret, or change the data in the blob in any way. When the data blob, the payload before base64-encoding, is added to the partition key size, the total size must not exceed the maximum record size of 1 MB. Type: Base64-encoded binary data object).<\/li><li>If you have sensitive data, you can enable server-side data encryption when you use Amazon Kinesis Data Firehose. However, this is only possible if you use a Kinesis stream as your data source. When you configure a Kinesis stream as the data source of a Kinesis Data Firehose delivery stream, Kinesis Data Firehose no longer stores the data at rest. Instead, the data is stored in the Kinesis stream.<\/li><li>When you send data from your data producers to your Kinesis stream, the Kinesis Data Streams service encrypts your data using an AWS KMS key before storing it at rest. When your Kinesis Data Firehose delivery stream reads the data from your Kinesis stream, the Kinesis Data Streams service first decrypts the data and then sends it to Kinesis Data Firehose. Kinesis Data Firehose buffers the data in memory based on the buffering hints that you specify and then delivers it to your destinations without storing the unencrypted data at rest.<\/li><li>In Kinesis , to prevent skipped records, handle all exceptions within processRecords appropriately.<\/li><li>For each Amazon Kinesis Data Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application&#8217;s state. Because the KCL uses the name of the Amazon Kinesis Data Streams application to create the name of the table, each application name must be unique.<\/li><li>If your Amazon Kinesis Data Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. The KCL creates the table with a provisioned throughput of 10 reads per second and 10 writes per second, but this might not be sufficient for your application. For example, if your Amazon Kinesis Data Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.<\/li><li>PutRecord returns the shard ID of where the data record was placed and the sequence number that was assigned to the data record. Sequence numbers increase over time and are specific to a shard within a stream, not across all shards within a stream. To guarantee strictly increasing ordering, write serially to a shard and use the SequenceNumberForOrdering parameter.<\/li><li>For live streaming Kinesis gets ruled out if record size greater than 1 MB , in that case Kafka can support bigger records.<\/li><li>You can trigger One lambda per shard. If you want to use Lambda with Kinesis Streams, you need to create Lambda functions to automatically read batches of records off your Amazon Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.<\/li><li>In Kinesis stream ,the PutRecordBatch() operation can take up to 500 records per call or 4 MB per call, whichever is smaller. Buffer size ranges from 1 MB to 128 MB.<\/li><li>In circumstances where data delivery to the destination is falling behind data ingestion into the delivery stream, Amazon Kinesis Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.<\/li><li>If data delivery to&nbsp; Redshift fail from Kinesis Firehose , Amazon Kinesis Firehose retries data delivery every 5 minutes for up to a maximum period of 60 minutes. After 60 minutes, Amazon Kinesis Firehose skips the current batch of S3 objects that are ready for COPY and moves on to the next batch. The information about the skipped objects is delivered to your S3 bucket as a manifest file in the errors folder, which you can use for manual backfill. For information about how to COPY data manually with manifest files, see Using a Manifest to Specify Data Files.<\/li><li>If data delivery to your Amazon S3 bucket fails , Amazon Kinesis Firehose retries to deliver data every 5 seconds for up to a maximum period of 24 hours. If the issue continues beyond the 24-hour maximum retention period, it discards the data.<\/li><li>Aggregation refers to the storage of multiple records in a Streams record. Aggregation allows customers to increase the number of records sent per API call, which effectively increases producer throughput. Aggregation Storing multiple records within a single Kinesis Data Streams record while Collection using the API operation PutRecords to send multiple Kinesis Data Streams records to one or more shards in your Kinesis data stream.You can first aggregate stream record and then send them to stream using collection putrecords() in multiple shard.<\/li><li>Spark Streaming uses the Kinesis Client Library (KCL) to consume data from a Kinesis stream. KCL handles complex tasks like load balancing, failure recovery, and check-pointing<\/li><li>Amazon Kinesis Data Streams has the following stream and shard limits.<ul><li>There is no upper limit on the number of shards you can have in a stream or account. It is common for a workload to have thousands of shards in a single stream.<\/li><li>There is no upper limit on the number of streams you can have in an account.<\/li><li>A single shard can ingest up to 1 MiB of data per second (including partition keys) or 1,000 records per second for writes. Similarly, if you scale your stream to 5,000 shards, the stream can ingest up to 5 GiB per second or 5 million records per second. If you need more ingest capacity, you can easily scale up the number of shards in the stream using the AWS Management Console or the UpdateShardCount API.<\/li><li>The default shard limit is 500 shards for the following AWS Regions: US East (N. Virginia), US West (Oregon), and EU (Ireland). For all other Regions, the default shard limit is 200 shards.<\/li><li>The maximum size of the data payload of a record before base64-encoding is up to 1 MiB.<\/li><li>GetRecords can retrieve up to 10 MiB of data per call from a single shard, and up to 10,000 records per call. Each call to GetRecords is counted as one read transaction.<\/li><li>Each shard can support up to five read transactions per second. Each read transaction can provide up to 10,000 records with an upper limit of 10 MiB per transaction.<\/li><li>Each shard can support up to a maximum total data read rate of 2 MiB per second via GetRecords. If a call to GetRecords returns 10 MiB, subsequent calls made within the next 5 seconds throw an exception.<\/li><\/ul><\/li><\/ul>\n\n\n\n<p><strong>Creating a\nStream in Amazon Kinesis<\/strong><\/p>\n\n\n\n<p>You can create a stream using the Kinesis Data Streams\nconsole, the Kinesis Data Streams API, or the AWS Command Line Interface (AWS\nCLI).<\/p>\n\n\n\n<p>To create a data stream using the console<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Sign in to the AWS Management Console and open the Kinesis console at https:\/\/console.aws.amazon.com\/kinesis.<\/li><li>In the navigation bar, expand the Region selector and choose a Region.<\/li><li>Choose Create data stream.<\/li><li>On the Create Kinesis stream page, enter a name for your stream and the number of shards you need, and then click Create Kinesis stream.<\/li><li>On the Kinesis streams page, your stream&#8217;s Status is Creating while the stream is being created. When the stream is ready to use, the Status changes to Active.<\/li><li>Choose the name of your stream. The Stream Details page displays a summary of your stream configuration, along with monitoring information.<\/li><\/ul>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"382\" height=\"180\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-01.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1176\"\/><\/figure><\/div>\n\n\n\n<p><strong>Kinesis Data\nStreams Producers<\/strong><\/p>\n\n\n\n<p>A producer puts data records into Amazon Kinesis data\nstreams. For example, a web server sending log data to a Kinesis data stream is\na producer. A consumer processes the data records from a stream. <\/p>\n\n\n\n<p>To put data into the stream, you must specify the name\nof the stream, a partition key, and the data blob to be added to the stream.\nThe partition key is used to determine which shard in the stream the data\nrecord is added to. <\/p>\n\n\n\n<p>All the data in the shard is sent to the same worker\nthat is processing the shard. Which partition key you use depends on your\napplication logic. The number of partition keys should typically be much\ngreater than the number of shards. This is because the partition key is used to\ndetermine how to map a data record to a particular shard. If you have enough\npartition keys, the data can be evenly distributed across the shards in a\nstream.<\/p>\n\n\n\n<p>Using KPL &#8211; The KPL is an easy-to-use, highly configurable\nlibrary that helps you write to a Kinesis data stream. It acts as an\nintermediary between your producer application code and the Kinesis Data\nStreams API actions. The KPL performs the following primary tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Writes\nto one or more Kinesis data streams with an automatic and configurable retry\nmechanism<\/li><li>Collects\nrecords and uses PutRecords to write multiple records to multiple shards per\nrequest<\/li><li>Aggregates\nuser records to increase payload size and improve throughput<\/li><li>Integrates\nseamlessly with the Kinesis Client Library (KCL) to de-aggregate batched\nrecords on the consumer<\/li><li>Submits\nAmazon CloudWatch metrics on your behalf to provide visibility into producer\nperformance<\/li><\/ul>\n\n\n\n<p>Using the Amazon Kinesis Data Streams API &#8211; You\ncan develop producers using the Amazon Kinesis Data Streams API with the AWS\nSDK for Java. Once a stream is created, you can add data to it in the form of\nrecords. A record is a data structure that contains the data to be processed in\nthe form of a data blob. After you store the data in the record, Kinesis Data\nStreams does not inspect, interpret, or change the data in any way. Each record\nalso has an associated sequence number and partition key. There are two\ndifferent operations in the Kinesis Data Streams API that add data to a stream,\nPutRecords and PutRecord. The PutRecords operation sends multiple records to\nyour stream per HTTP request, and the singular PutRecord operation sends\nrecords to your stream one at a time (a separate HTTP request is required for\neach record). You should prefer using PutRecords for most applications because\nit will achieve higher throughput per data producer.<\/p>\n\n\n\n<p>Using Kinesis Agent &#8211; Kinesis Agent is a\nstand-alone Java software application that offers an easy way to collect and\nsend data to Kinesis Data Streams. The agent continuously monitors a set of\nfiles and sends new data to your stream. The agent handles file rotation,\ncheckpointing, and retry upon failures. It delivers all of your data in a\nreliable, timely, and simple manner. It also emits Amazon CloudWatch metrics to\nhelp you better monitor and troubleshoot the streaming process. By default,\nrecords are parsed from each file based on the newline (&#8216;\\n&#8217;) character. Your\noperating system must be either Amazon Linux AMI with version 2015.09 or later,\nor Red Hat Enterprise Linux version 7 or later.<\/p>\n\n\n\n<p>Using Consumers with Enhanced Fan-Out\n&#8211; In Amazon Kinesis Data Streams, you can build consumers that use a feature\ncalled enhanced fan-out. This feature enables consumers to receive records from\na stream with throughput of up to 2 MiB of data per second per shard. This\nthroughput is dedicated, which means that consumers that use enhanced fan-out\ndon&#8217;t have to contend with other consumers that are receiving data from the\nstream. Kinesis Data Streams pushes data records from the stream to consumers\nthat use enhanced fan-out. Therefore, these consumers don&#8217;t need to poll for\ndata. You can register up to five consumers per stream to use enhanced fan-out.\nIf you need to register more than five consumers, you can request a limit\nincrease<\/p>\n\n\n\n<p>Splitting a Shard &#8211; To split a shard in Amazon Kinesis\nData Streams, you need to specify how hash key values from the parent shard\nshould be redistributed to the child shards. When you add a data record to a\nstream, it is assigned to a shard based on a hash key value. The hash key value\nis the MD5 hash of the partition key that you specify for the data record at\nthe time that you add the data record to the stream. Data records that have the\nsame partition key also have the same hash key value. <\/p>\n\n\n\n<p>Merging Two Shards &#8211; A shard merge operation takes two\nspecified shards and combines them into a single shard. After the merge, the\nsingle child shard receives data for all hash key values covered by the two parent\nshards. To merge two shards, the shards must be adjacent. Two shards are\nconsidered adjacent if the union of the hash key ranges for the two shards\nforms a contiguous set with no gaps. For example, suppose that you have two\nshards, one with a hash key range of 276&#8230;381 and the other with a hash key\nrange of 382&#8230;454. You could merge these two shards into a single shard that\nwould have a hash key range of 276&#8230;454.<\/p>\n\n\n\n<p><strong>Kinesis Data\nStreams Consumers<\/strong><\/p>\n\n\n\n<p>A consumer, known as an Amazon Kinesis Data Streams application,\nis an application that you build to read and process data records from Kinesis\ndata streams. If you want to send stream records directly to services such as\nAmazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon\nElasticsearch Service (Amazon ES), or Splunk, you can use a Kinesis Data\nFirehose delivery stream instead of creating a consumer application.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">AWS Kinesis Data Firehose<\/h2>\n\n\n\n<p>Amazon Kinesis Data Firehose is a fully managed service\nfor delivering real-time streaming data to destinations such as Amazon Simple\nStorage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service\n(Amazon ES), and Splunk. Kinesis Data Firehose is part of the Kinesis streaming\ndata platform, along with Kinesis Data Streams, Kinesis Video Streams, and Amazon\nKinesis Data Analytics. With Kinesis Data Firehose, you don&#8217;t need to write\napplications or manage resources. You configure your data producers to send\ndata to Kinesis Data Firehose, and it automatically delivers the data to the\ndestination that you specified. <\/p>\n\n\n\n<p><strong>Terminology <\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>record\n&#8211; The data of interest that your data producer sends to a Kinesis Data Firehose\ndelivery stream. A record can be as large as 1,000 KB.<\/li><li>data\nproducer &#8211; Producers send records to Kinesis Data Firehose delivery streams.\nFor example, a web server that sends log data to a delivery stream is a data\nproducer. You can also configure your Kinesis Data Firehose delivery stream to\nautomatically read data from an existing Kinesis data stream, and load it into destinations.<\/li><li>buffer\nsize and buffer interval &#8211; Kinesis Data Firehose buffers incoming streaming\ndata to a certain size or for a certain period of time before delivering it to\ndestinations. Buffer Size is in MBs and Buffer Interval is in seconds.<\/li><\/ul>\n\n\n\n<p><strong>Data Flow<\/strong><\/p>\n\n\n\n<p>For Amazon S3 destinations, streaming data is delivered\nto your S3 bucket. If data transformation is enabled, you can optionally back\nup source data to another Amazon S3 bucket.<\/p>\n\n\n\n<p>Amazon Kinesis Data Firehose data flow for Amazon S3<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"629\" height=\"247\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-02.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1177\"\/><\/figure>\n\n\n\n<p>For Amazon Redshift destinations, streaming data is\ndelivered to your S3 bucket first. Kinesis Data Firehose then issues an Amazon\nRedshift COPY command to load data from your S3 bucket to your Amazon Redshift\ncluster. If data transformation is enabled, you can optionally back up source\ndata to another Amazon S3 bucket.<\/p>\n\n\n\n<p>Amazon Kinesis Data Firehose data flow for Amazon Redshift<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"261\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-03-750x261.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1178\" srcset=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-03-750x261.png 750w, https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-03.png 785w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/figure><\/div>\n\n\n\n<p>For Amazon ES destinations, streaming data is delivered\nto your Amazon ES cluster, and it can optionally be backed up to your S3 bucket\nconcurrently.<\/p>\n\n\n\n<p>Amazon Kinesis Data Firehose data flow for Amazon ES<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"554\" height=\"284\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-04.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1179\"\/><\/figure><\/div>\n\n\n\n<p>For Splunk destinations, streaming data is delivered to\nSplunk, and it can optionally be backed up to your S3 bucket concurrently.<\/p>\n\n\n\n<p>Amazon Kinesis Data Firehose data flow for Splunk<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"652\" height=\"276\" src=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-05.png\" alt=\"determine the operational characteristics of the collection system\n\" class=\"wp-image-1180\"\/><\/figure><\/div>\n\n\n\n<p>\nLink for free practice test &#8211; <a href=\"https:\/\/www.testpreptraining.ai\/aws-certified-big-data-specialty-free-practice-test\">https:\/\/www.testpreptraining.ai\/aws-certified-big-data-specialty-free-practice-test<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AWS Big Data Exam updated to AWS Certified Data Analytics Specialty. Big Data 4Vs or features Volume \u2013 It is related to enormous size. Variety \u2013 It is the heterogeneous sources and the nature of data, both structured and unstructured. It can be emails, photos, videos, monitoring devices, PDFs, audio, etc. for analysis. Velocity \u2013&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":1031,"menu_order":2,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"class_list":["post-1063","page","type-page","status-publish","hentry","category-amazon-aws"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials<\/title>\n<meta name=\"description\" content=\"Determine the operational characteristics of the collection system tutorial, notes\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials\" \/>\n<meta property=\"og:description\" content=\"Determine the operational characteristics of the collection system tutorial, notes\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/\" \/>\n<meta property=\"og:site_name\" content=\"Testprep Training Tutorials\" \/>\n<meta property=\"article:modified_time\" content=\"2022-03-03T06:56:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-750x323.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"20 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/\",\"url\":\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/\",\"name\":\"Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials\",\"isPartOf\":{\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#website\"},\"datePublished\":\"2019-07-09T10:50:58+00:00\",\"dateModified\":\"2022-03-03T06:56:28+00:00\",\"description\":\"Determine the operational characteristics of the collection system tutorial, notes\",\"breadcrumb\":{\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.testpreptraining.ai\/tutorial\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AWS Certified Big Data Specialty\",\"item\":\"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Determine the Operational Characteristics of the Collection System\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#website\",\"url\":\"https:\/\/www.testpreptraining.ai\/tutorial\/\",\"name\":\"Testprep Training Tutorials\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.testpreptraining.ai\/tutorial\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#organization\",\"name\":\"Testprep Training\",\"url\":\"https:\/\/www.testpreptraining.ai\/tutorial\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.testpreptraining.com\/tutorial\/wp-content\/uploads\/2020\/07\/tpt-logo-6.png\",\"contentUrl\":\"https:\/\/www.testpreptraining.com\/tutorial\/wp-content\/uploads\/2020\/07\/tpt-logo-6.png\",\"width\":583,\"height\":153,\"caption\":\"Testprep Training\"},\"image\":{\"@id\":\"https:\/\/www.testpreptraining.ai\/tutorial\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials","description":"Determine the operational characteristics of the collection system tutorial, notes","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/","og_locale":"en_US","og_type":"article","og_title":"Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials","og_description":"Determine the operational characteristics of the collection system tutorial, notes","og_url":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/","og_site_name":"Testprep Training Tutorials","article_modified_time":"2022-03-03T06:56:28+00:00","og_image":[{"url":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-content\/uploads\/2019\/07\/determine-the-operational-characteristics-of-the-collection-system-750x323.png"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/","url":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/","name":"Determine the Operational Characteristics of the Collection System - Testprep Training Tutorials","isPartOf":{"@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#website"},"datePublished":"2019-07-09T10:50:58+00:00","dateModified":"2022-03-03T06:56:28+00:00","description":"Determine the operational characteristics of the collection system tutorial, notes","breadcrumb":{"@id":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/determine-the-operational-characteristics-of-the-collection-system\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.testpreptraining.ai\/tutorial\/"},{"@type":"ListItem","position":2,"name":"AWS Certified Big Data Specialty","item":"https:\/\/www.testpreptraining.ai\/tutorial\/aws-certified-big-data-specialty\/"},{"@type":"ListItem","position":3,"name":"Determine the Operational Characteristics of the Collection System"}]},{"@type":"WebSite","@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#website","url":"https:\/\/www.testpreptraining.ai\/tutorial\/","name":"Testprep Training Tutorials","description":"","publisher":{"@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.testpreptraining.ai\/tutorial\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#organization","name":"Testprep Training","url":"https:\/\/www.testpreptraining.ai\/tutorial\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#\/schema\/logo\/image\/","url":"https:\/\/www.testpreptraining.com\/tutorial\/wp-content\/uploads\/2020\/07\/tpt-logo-6.png","contentUrl":"https:\/\/www.testpreptraining.com\/tutorial\/wp-content\/uploads\/2020\/07\/tpt-logo-6.png","width":583,"height":153,"caption":"Testprep Training"},"image":{"@id":"https:\/\/www.testpreptraining.ai\/tutorial\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/pages\/1063","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/comments?post=1063"}],"version-history":[{"count":7,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/pages\/1063\/revisions"}],"predecessor-version":[{"id":51829,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/pages\/1063\/revisions\/51829"}],"up":[{"embeddable":true,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/pages\/1031"}],"wp:attachment":[{"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/media?parent=1063"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/categories?post=1063"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.testpreptraining.ai\/tutorial\/wp-json\/wp\/v2\/tags?post=1063"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}