Popular Posts

SQLblog.com - The SQL Server blog spot on the web

Sunday, December 17, 2017

Query big data, stored in a csv or parquet files, from any stored procedure

SQLblog.com - The SQL Server blog spot on the web
Query big data, stored in a csv or parquet files, from any stored procedure

Every cloud provider these days have a serverless interactive query service that uses standard SQL for data analysis.

As for the biggest cloud providers, we have Azure Data Lake analytics, Google BigQuery and Amazon Athena.

Due to the fact that my company is using only Amazon cloud, I have some experience with Athena service and I must say that it is awesome.

I will show you today how you can use Management Studio or any stored procedure to query the data, stored in a csv file, located on S3 storage. I am using CSV file format as an example here, columnar PARQUET gives much better performance.

I am going to:

1. Put a simple CSV file on S3 storage

2. Create External table in Athena service over the data file bucket

        3. Create linked server to Athena inside SQL Server  

        4. Use OPENQUERY to query the data.  

Athena service is built on the top of Presto, distributed SQL engine and also uses Apache Hive to create, alter and drop tables. You can run ANSI SQL statements in the Athena query editor, launching it from the AWS web services UI. You can use complex joins, window functions and many other great SQL language features. Using Athena eliminates need for ETL because it projects your schema on the data files at the time of the query.

Let’s create database in Athena query editor.

image

As a next step I will put this csv file on S3. By the way, Athena supports JSON format, tsv, csv, PARQUET and AVRO formats.

Uploading the below file to S3 bucket (don’t put columns header in file):

image

image

Going back to Athena to create external table over the S3 folder.  You can later add more files to the same folder and your queries will return the new data immidiately.

image

Now I can query the data:

image

image

As a next step I will set up a linked server from my SQL Server instance because I would like to offload the big data querying to Athena. Of course, I am using in this example tiny data file but in real life we are querying sometimes 300GB of data files in the single query and it takes a few seconds.

Athena has ODBC driver, I will install it on SQL Server machine ( AWS EC2 instance for this example).

Here is an installation link: https://s3.amazonaws.com/athena-downloads/drivers/ODBC/Windows/Simba+Athena+1.0+64-bit.msi

Setting up the ODBC connection. Important, click on the Authentication Option and fill in AccessKey and SecretKey that have permissions to access S3 bucket. S3 output location below will hold csv files with results from your queries. Remember to clean the output files from time to time.

image

What is left is to set up Linked Server inside Management Studio using OLEDB provider for ODBC.

EXEC master.dbo.sp_addlinkedserver @server = N'DWH_ATHENA', @srvproduct=N'', @provider=N'MSDASQL', @datasrc=N'DWH_ATHENA'

GO

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'DWH_ATHENA',@useself=N'False',@locallogin=NULL,@rmtuser=N'*******',@rmtpassword='*********'

GO

Replace @rmtuser and @rmtpassword with AWS access key and secret key and now we can query the data files from any script or stored procedure.

There is one very important thing that you need to know. Regular SQL Server ODBC destinations querying behavior of is to send “select *” to linked server and do filtering inside SQL Server. This is very bad for us since we wanted to offload all work to Athena and we do not want to receive all data.  The way to overcome this is to use OPENQUERY.

Here is example of the query that is using a linked server. The remote query has omitted all filtering and receiving ALL columns from the remote table and filter is being applied later on, inside the “Filter” step.

image

The same query that is using OPENQUERY instead of linked server:

image

Result.

image

 

Isn’t it wonderful to be able to keep the row data in files and query them with minimal effort from the SQL Server side?

Yours,

Maria

No comments: