I’m working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.
The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:
select A, B, C from redshift__table where D='x' and E = 'y';
After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.
Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?
Now let me describe an alternate scenario:
If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)
But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.
So I wish to know on a high level which approach would be better:
Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.
Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.
Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.
Source: Stack Overflow