AWS Redshift
Amazon Redshift is a cloud-based data warehousing solution provided by AWS. It is an OLAP type of database service that runs complex queries across a huge dataset. These services are normally used to create business intelligence reports and in decision-making. Organizations that have large amounts of transactional data can process this data to identify and analyze trends and patterns of consumer behavior.
In a way, Amazon Redshift is a cluster management system designed towards managing petabyte-scale data in the cloud. It gives you the flexibility to select the number and size of nodes as part of the cluster being provisioned as per your need. Over time when the amount of data grows it also offers the function to resize the cluster as well as pass on the older data to cold storage like S3.
Redshift web interface offers various management options including creation of clusters, selecting the type and number of nodes, creating and restoring snapshots, etc. A Redshift cluster may have one or more compute nodes. Certain types of nodes require at least 2 nodes to exist in a cluster.
Out of all the nodes, a leader node has to be selected which accepts all the incoming queries. The leader node is responsible to parse the queries and create execution plans. Once the plan is ready, the leader node communicates with compute nodes for parallel execution of the plan. When the compute nodes are ready with their result set, the leader node consolidates the same and responds to the original query.
In a given Redshift cluster, all the nodes are of the same type. The type of node selected determines the allocation vCPU, memory, storage capacity per node. Node types are grouped as below:
RA3 - this is a recommended node type for Redshift clusters. It offers high compute capabilities along with scalable storage. RA3 type of nodes decouples the compute nodes from storage giving us the flexibility to choose appropriate storage that is not dictated by compute nodes. The intention is to not get locked into paying additional costs for the other party if one part scales.
DC2 - aka Dense compute nodes. These are compute-intensive nodes with local storage. SSDs are used for storage to derive faster query performance. These nodes are good in the initial days of data warehouse setup where the data is beginning to grow. However, in the growth period, it is not always the case that both compute and storage grow together. Given the local storage capabilities, if we want to scale one, we have to scale another.
DS2 - These are legacy node types with local HDD storage.
While creating the cluster on Redshift, we provide the database master user credential details along with other optional access permissions. Given the size and scale of Redshift cluster deployment, it is pretty clear that this is not suitable for smaller database deployments. It would be absurd to pay huge costs to maintain a 100 GB database on Redshift.
Redshift also gives us an option to purchase reserved nodes. Reserving a node can help reduce costs drastically. If you compare the estimates of creating an “on-demand” cluster with the estimates of creating a cluster with reserved nodes of the same type - the reserved node would cost at least 40-60% lesser. So if you are sure you are going to need something for long enough, it is probably a great idea to purchase reserved nodes.
It is possible to take snapshots, which are a point in time backups of the cluster. Automated snapshots can be configured. Based on the retention period, the snapshots are stored. One has to be careful while deciding upon the retention period - since Redshift snapshots are high in volume, maintaining snapshots for a long time can have a significant impact on the cost.
The schedule for automated snapshots can be specified using a simple cron notation. Automated snapshots can be turned off by setting the retention period to 0. Doing this also deletes the Redshift snapshots that are stored and managed on S3. Manual snapshots can be taken at any time. Snapshots created manually, have to be deleted manually.
Once the snapshots are created, they can also be copied to another AWS region to create a new Redshift cluster based on that snapshot. Creating a cluster based on the snapshot is often called restoration. Redshift snapshot contains the configuration information related to the type and number of nodes of the cluster. It is possible to restore a snapshot with a new configuration provided the new configuration is capable of storing the existing data.
Redshift clusters can be resized, where we can change the type and number of nodes that the cluster is running on. Resizing works in both ways, either to scale up or scale down, both horizontally and vertically. Resizing an existing cluster is possible and can be done in 2 ways - Elastic resize and Classic resize.
During elastic resizing, a snapshot is taken for the latest available data. A new cluster with a new configuration (new types and number of nodes) is created and the snapshot data is transferred to the new cluster. This operation takes a few minutes (15-20), during this time the cluster is unavailable for any write operation. Reads will still work. It is important to note the timestamp from when the resizing begins, as it helps us track down any missed write transactions during the process. Also, if automated snapshots are disabled and given the fact that snapshots are incremental, the very first snapshot can take a long time. So it is recommended to enable automated snapshots schedule.
Elastic resizing is not always supported. For example, if we want to transition from single-node mode to multiple nodes, or when the number of nodes grows from x to x+y in certain types of nodes. In these cases, Classic resizing is used. Classic resizing does not depend on snapshots. When a classic resizing is triggered it creates another cluster with a target configuration. Current cluster transitions to read-only mode. All the requests for write operations are dropped. When all the data is copied, the connection switchover takes place pointing to the new cluster. Depending on the workload, size, spread, and target configuration, classic resizing can take anywhere from 2 hours to 2 days.
It is always recommended to create a Redshift cluster in a VPC. That way, we can use the VPC security groups which are configured keeping access control and security in mind. NACLs on the subnet level provides an additional layer of security that comes along with being part of the VPC. If VPCs are not used, security groups can be configured for Redshift clusters independently. As far as security of Redshift clusters is concerned, this can be managed at 4 different levels as below:
Cluster management - This talks about providing the ability to manage clusters on AWS management console, CLI, and APIs. The access to this can be managed using IAM policies.
Cluster connectivity - The ability to connect to cluster nodes via SSH from remote is managed by Security groups. IP CIDR ranges can be specified in the security group rules to allow access from specific IPs.
Database access - Database users who can access the database itself. Similar to regular databases, users for clustered databases can be created with appropriate access.
Temporary access using SSO - Database drivers like JDBC and ODBC help manage user access to the database. It is possible to have temporary access to the Redshift database for federated users using SAML 2.0 compliant IdP.
Redshift also supports the encryption of the data stored. This is immutable property - once the data is stored in encrypted form, it cannot be converted to a non-encrypted form. Snapshots of encrypted storage are also encrypted, and when restored from these snapshots, they are restored in encrypted form.
Redshift stores access logs, operations performed, transactional logs and any kind of activity that happens on the database for monitoring purposes. Redshift also makes use of Amazon CloudWatch metrics to display performance data.