Solving AWS Athena + JDBC Simba Driver Connection Issue
What is AWS Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
AWS Athena (https://aws.amazon.com/athena/) is getting popular day by day because
1. It is server less which means less maintenance and management.
2. You only pay for the Scan , S3 Storage and Standard network cost (https://aws.amazon.com/athena/pricing/).
3. It can scan petabytes of data and most of all RDBMS concept works here. So any SQL developer can immediately start using it for warehousing or data lake purposes.
AWS Athena has both ODBC and JDBC drivers available but in this blog, will only focus on problem we faced using JDBC drivers.
Officially, AWS website recommends to use Simba driver for JDBC connection. You can refer AWS official website for more details (https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html)
Problem
Now, let me explain the problem we faced within our team. We are using Athena Simba driver version 42_2.0.13 and upon concurrent request to Athena, we observed the following:
- After firing a few mix of SELECT, CTAS (CREATE TABLE AS) and ALTER PARTITION queries, Athena server was not responding to the active connection and the thread was getting timeOutException. For this we fired around 1200–1300 queries.
- After firing 51 CTAS (CREATE TABLE AS) queries, For the 51st query, Athena Server was not responding back to Application server.
Debug Strategy
At this point of time, we had no clue. So we decided to do a few things
- Updating the Simba JDBC driver to latest version (42_2.0.27.1001). It did not help;
- Downgrading the Simba JDBC driver to previous version (42_2.0.5). It did not help; and
- Doing the Thread dump analysis ( JSTACK -<JAVA PID>)
By Looking at thread dump, we noticed that we enabled validation of connection upon borrowing from the connection from Pool. So we decided to temporary disable it and try.
This thread dump gave us our first clue. If you notice, It is showing call stack trace with AthenaStreamingServiceClient.java. We had read sometime back that Athena Simba drive is supporting Streaming for Result Set (https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.0.27.1000/doc/Simba+Athena+JDBC+Connector+Migration+Guide.pdf) starting from driver version 42_2.0.5.
Also we had noticed port 444 (which is by default getting used for Result Set Streaming) is not releasing once the query is completed and there are as many as 50 open connections on this port. Since we were debugging on the Windows box, used netstat -a in command prompt (CMD) to check port usage.
This gave us some level of confidence that Result Set Streaming might be creating some issue. After discussing with AWS support, we found the property key to disable it. It is UseResultsetStreaming=0 which should be part of connection URL string.
Old Connection URL
jdbc:awsathena://athena.us-east-2.amazonaws.com:443;catalog=xxx;schema=yyy;UID=123XYZ;PWD=123XYZ;S3OutputLocation=s3://folder/log/
New Connection URL to disable result set streaming
jdbc:awsathena://athena.us-east-2.amazonaws.com:443;UseResultsetStreaming=0;catalog=xxx;schema=yyy;UID=123XYZ;PWD=123XYZ;S3OutputLocation=s3://folder/log/
With the above change, we did our testing and the problem was solved.
Downside of disabling Result Set Streaming
Disabling Result set Streaming will have performance impact of SELECT results if number of rows are beyond 1000. Even if developer is setting fetch size, it will get overridden by the driver and application will end up getting 1000 records in one go which will increase network trips between application and Athena Result Engine.
Hope this blog will provide a quick insight to all of you.