Solving AWS Athena + JDBC Simba Driver Connection Issue

deepak agrawal
3 min readMar 10, 2022

--

What is AWS Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

AWS Athena (https://aws.amazon.com/athena/) is getting popular day by day because

1. It is server less which means less maintenance and management.

2. You only pay for the Scan , S3 Storage and Standard network cost (https://aws.amazon.com/athena/pricing/).

3. It can scan petabytes of data and most of all RDBMS concept works here. So any SQL developer can immediately start using it for warehousing or data lake purposes.

AWS Athena has both ODBC and JDBC drivers available but in this blog, will only focus on problem we faced using JDBC drivers.

Officially, AWS website recommends to use Simba driver for JDBC connection. You can refer AWS official website for more details (https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html)

Problem

Now, let me explain the problem we faced within our team. We are using Athena Simba driver version 42_2.0.13 and upon concurrent request to Athena, we observed the following:

  1. After firing a few mix of SELECT, CTAS (CREATE TABLE AS) and ALTER PARTITION queries, Athena server was not responding to the active connection and the thread was getting timeOutException. For this we fired around 1200–1300 queries.
  2. After firing 51 CTAS (CREATE TABLE AS) queries, For the 51st query, Athena Server was not responding back to Application server.

Debug Strategy

At this point of time, we had no clue. So we decided to do a few things

  1. Updating the Simba JDBC driver to latest version (42_2.0.27.1001). It did not help;
  2. Downgrading the Simba JDBC driver to previous version (42_2.0.5). It did not help; and
  3. Doing the Thread dump analysis ( JSTACK -<JAVA PID>)
Thread dump analysis using https://fastthread.io/ with validate query enabled upon borrow

By Looking at thread dump, we noticed that we enabled validation of connection upon borrowing from the connection from Pool. So we decided to temporary disable it and try.

Thread dump analysis using https://fastthread.io/ with validate query disabled upon borrow
Thread dump analysis using https://fastthread.io/ with validate query disabled upon borrow

This thread dump gave us our first clue. If you notice, It is showing call stack trace with AthenaStreamingServiceClient.java. We had read sometime back that Athena Simba drive is supporting Streaming for Result Set (https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.0.27.1000/doc/Simba+Athena+JDBC+Connector+Migration+Guide.pdf) starting from driver version 42_2.0.5.

Also we had noticed port 444 (which is by default getting used for Result Set Streaming) is not releasing once the query is completed and there are as many as 50 open connections on this port. Since we were debugging on the Windows box, used netstat -a in command prompt (CMD) to check port usage.

This gave us some level of confidence that Result Set Streaming might be creating some issue. After discussing with AWS support, we found the property key to disable it. It is UseResultsetStreaming=0 which should be part of connection URL string.

Old Connection URL

jdbc:awsathena://athena.us-east-2.amazonaws.com:443;catalog=xxx;schema=yyy;UID=123XYZ;PWD=123XYZ;S3OutputLocation=s3://folder/log/

New Connection URL to disable result set streaming

jdbc:awsathena://athena.us-east-2.amazonaws.com:443;UseResultsetStreaming=0;catalog=xxx;schema=yyy;UID=123XYZ;PWD=123XYZ;S3OutputLocation=s3://folder/log/

With the above change, we did our testing and the problem was solved.

Downside of disabling Result Set Streaming

Disabling Result set Streaming will have performance impact of SELECT results if number of rows are beyond 1000. Even if developer is setting fetch size, it will get overridden by the driver and application will end up getting 1000 records in one go which will increase network trips between application and Athena Result Engine.

Hope this blog will provide a quick insight to all of you.

--

--

deepak agrawal
deepak agrawal

Written by deepak agrawal

Coding is an art, art to imagine Data Structure holding your data and solving problems. I am learning to become an artist.

Responses (3)