Amazon QuickSight is a powerful application for real time business intelligence.
What’s more? It charges you for usage ONLY.
In this case, let’s go back to the Amazon Kinesis Data Stream you had created before.
You can also create a brand new data stream if you want to experiment with these two methods.
For our purposes, let’s create a brand new Amazon Kinesis data stream.
Once you create a Kinesis data stream, you will see both producers and consumers in your data stream like below.
Next, let’s create a consumer that’s an Amazon Kinesis Data Firehose (the button that says “Process with delivery stream”).
So, effectively, you are taking the stream of data that Amazon Connect contact center is producing (Btw, do note that Amazon Connect produces multiple logs namely agent logs, contact flow logs, contact trace records logs etc), and you are storing them for further analysis. Your analysis could be real time or historical – each has to be treated differently.
What you want to do is to flatten those json files that Amazon Connect contact center CTR records produces.
Again, in this case, you can choose to not “Transform source records with AWS Lambda“. In this case, you are sending the raw data stream logs to Amazon S3.
If you choose not to transform source records with AWS Lambda at the firehose level itself, you can flatten those files once they hit the S3 bucket. You will have to do this with an AWS Lambda that is triggered when the file hits the Amazon S3 bucket.
For the moment, let’s send these CTR log files to Amazon S3.
In your Amazon Kinesis Firehose delivery stream, make sure that you select the right Kinesis data stream for contact trace records
That’s the same one you are going to use in your Amazon Kinesis Firehose delivery stream.
This will log to a bucket of your choice. You can choose prefixes – e.g. fh_logs/ and fh_error_logs/ for your logs and error logs. The slash is needed if those are going to be directories. If you want to simply prepend a prefix you can choose fh_logs_ and fh_error_logs_ OR you can leave this entirely blank.
Place some test calls and answer your own calls to test this out.
Confirm that you do receive log files in the bucket you created. E.g.
You can quickly query any of those files with Amazon S3 Select to see the output of the file
Thereafter choose JSON as input and output formats
Note that this is not a flat format – rather in JSON structured data format
So, your next step is to flatten these files. To do this, you will create a lambda function and associate it with a trigger. The trigger will be that when a file is added to your S3 bucket via a PUT operation, it will flatten the file and will write it back to the same Amazon S3 bucket.
You can use any language you are comfortable with.
All this does is to take the nested directory structure that Amazon Kinesis puts the CTR records in, within Amazon S3 and flattens it to the flatfiles directory. That’s about it.
This makes it increasingly easier for Amazon Glue to crawl the flatfiles directory and make any schema discovery and schema changes / updates if and when discovered.
Whoa wait – Amazon Glue? For what?
Simple – we are going to use Amazon Glue to discover a schema from the giant JSON that the Amazon Connect Call center CTR records are in. Schema changes can and do occur – that’s why this is not a one time thing. Keep in mind that Amazon Connect Contact center CTR attributes allow you to create multiple custom options based on whatever call flow you have decided. So, when this happens, you need to capture this information to flatten it out.
These CTR attributes are just key value pairs, so instead of traversing through a giant list of key value pairs, you can use Glue crawler to discover the updated schema and update your database table with the new schema as well.
That’s why we use Amazon Glue here.
This doesn’t mean that you run Amazon Glue crawler all the time. You can trigger the Amazon Glue crawler via events. So, based on CTR attribute change events, you trigger the AWS Glue crawler, discover and update the schema / database tables. That’s it.
So, back to the Lambda function you created just now. Next step is to make sure that it triggers on appropriate events – i.e. when a file lands in the CTR bucket you created in S3.
Once you have done this, make some calls to and from your number. You will see ctr records being created in the S3 bucket you created above. You will also see an additional folder and files inside that folder
Your next step is to create an AWS Glue Crawler. A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog.
Go ahead and create an AWS Glue crawler.
See above for the configuration that you would need to create.
Keep in mind that your Glue crawler ETL job really does not need to crawl ALL the files for eternity. It just needs to crawl the latest files to find schema changes, so make sure you set a lifecycle policy on your flatfiles folder to something short – e.g. 1 day or 2 days (there are LOTs of files – albeit small).
Run the crawler after creating it. If all goes well, the table will show up in AWS Glue (and you can view the data / contents using AWS Athena).
If there are any errors, check CloudWatch for the possible errors
If you do not like the schema that Amazon Glue has inferred / discovered for you, you can edit the schema as well
Next, go to Amazon Athena and find this database and the table.
In all probability, it will look something like this
Note that this table still has nested JSONs as various columns.
You could store the files in this format as well, but we recommend transforming these files into parquet format from the JSON format that it is currently in.
If you modify the files to parquet format, the storage is optimized a lot AND on top of it, athena queries become infinitely easier to run (much like regular SQL rather than running UNNEST commands that are typical of PrestoDB.
One HUGE benefit of using Amazon Athena is that you only pay for the queries you run 🙂 We LOVE it.
Keep in mind that you want to store call logs for an extended period of time because you want to perform historical analysis / reporting. Meanwhile, you want the latest (maybe the last 15-30 mins) data to perform “real time” reporting as well.
After you have data in the flatfiles folder, you can use AWS Glue to catalog the data and transform it into Parquet format inside a separate folder. In this case, the AWS Glue job performs the ETL that transforms the data from JSON to Parquet format. However, do keep in mind that jobs cannot be run faster than 5 mins. So, that’s a limitation you need to be aware of and work around / work with.
Considering that this is not a real time stock trading application, you can get away with processing data every 5 mins.
Great, so your next step would be to (again) use another AWS Glue crawler to crawl through this parquet directory and create a table for you.
You have already created a crawler before, so this shouldn’t be a problem for you.
Go ahead and create a Glue Job to transform your JSON files in the flattened directory to parquet format.
We would advise you to use AWS Glue Studio to visually create and monitor jobs
And use these parameters to drive the job
Click on Advanced Properties dropdown and fill in the details. We tend to collocate our work logically in a single bucket, but you can choose to do this however it works for you.
You will notice that the AWS Glue job is ready to be run. You can click on the Script tab to fill in the details.
Keep in mind that this script does a couple of things
- It waits for the first AWS Glue crawler to finish running.
- Then it uses relationalize to flattens the nested json attributes of the data
- Then, it removes fields with null values
- Then it deletes the repeated records using distinct
- It also turns column names to lowercase – this follows Athena Best Practices (Hive ones)
- Finally, the result data is sent to the target
Go ahead and run this first crawler. Run the job to try it out. If there are errors, you will find them in the AWS Cloudwatch error logs (see link).
If there are no errors, you will find the ctr files now available in the parquet/ctr bucket you had created.
Next up, you can create a trigger and schedule it to run every 5-10 mins.. And have it kick off the AWS Glue crawler.
This trigger is going to kick off the parquet transformation job you had created.
Keep in mind that the script in that transformation job starts the source crawler at this line.. glue.start_crawler(Name=sourcecrawler) . The sourcecrawler is what you fed the script. In our case, it was “nengage-ccp-crawler”.
Schedule it every 5-10 mins (example) 0/10 13-22 ? * MON-FRI *
Keep in mind that the cron is in UTC timing (hence the 13-22)
Verify that your crawler is running
This crawler, as you recall, crawls the flatfiles folder in your S3 bucket to discover changes to schema (if any).
Typically you would see this output if you haven’t made any changes
After running this crawler, the transform job next updates the parquet files in your destination path – /parquet/ctr within the S3 bucket you had created.
After this, it kicks off the results crawler – which in our case was the “nengage-ccp-parquet-crawler”. This crawler, if you recall, crawls through the parquet files created by AWS Glue job and infers schema for the parquet formatted CTR data.
So, if there were any updates to the schema, you would see the updates reflected in the parquet table as well.
So now, you have
- Captured CTR logs of your Amazon Connect Contact center instance via an Amazon Kinesis Data stream (real time).
- You have created an Amazon Kinesis Firehose data delivery stream to take the Kinesis stream of CTR records and delivered the file to an Amazon S3 bucket of your choice.
- You have created a lambda function that takes the files from the deep folder structure in the Amazon S3 bucket and moves them to a flat directory called flatfiles.
- An AWS Glue crawler was created to take the files, infer the schema, create a table for you. This table is accessible to you in Amazon Athena. In our case, that table was called ccp_flatfiles.
- You have created an AWS Glue job cataloging the data and transforming it into Parquet format. After cataloging, AWS Glue places this data inside a separate AWS S3 bucket – e.g. parquet/ctr.
- You created another Glue crawler to take the parquet files, infer the schema, and create a table for you. This table is also accessible to you in Amazon Athena.