AWS Athena offers something quite fun: the opportunity to make SQL queries against data stored in S3 buckets as if they were SQL tables. Many things are represented by a formatted text file, like the description of an object or a serialised event. Sometimes we want to consume a feed of them in a format that we don't control, in which case we will need an ETL (Extract, Transform and Load) process that will take input in a file format and push it into a database.
I bet every single business in the world has ETL jobs. And because they are so crucial to the business operation they are a constant risk. They're changing all the time, often bolting on more and more boilerplate to the monolith; what if we introduce a bug? What if they fall over at our busiest period? What if someone starts sending us junk, how are we going to roll back the bad data?
With these worries, treating the input files as the data store and querying them directly would make life simpler. Crappy data? Delete the files. Schema change? Try a different query. Volume through the roof every Thursday? Hey, Amazon can handle saving to S3.
In the music business, our ETL source are the industry standard xml files called DDEX that describe a release: the territories it is licensed to, the formats its is available in, the special offers that apply as well as the track names, performers and album artwork. They are pretty large documents, designed by committee and we need to ingest one for every release in the catalogue. And every update. And every re-release. And every release that is licensed differently across territories. So we copy, parse and write them to the database so we know what's in our catalogue.
What if we could just put them in an S3 bucket, and do the queries against them as if they were a database? They would be exactly as the supplier sent them, a perfect golden source.
So I tried doing that, by setting up an Athena database to query them. The results were a bit meh. For a start, Athena is not available for XML files. Well, that's a drag. It's a popular format. As a workaround I wrote a simple Lambda function that would serialise the XML into json. It's here for the curious. But even with a JSON file Athena doesn't make it easy.
So we set up a new table by clicking on the teeny-tiny new table link, and get this dialogue:
(note the Location format is s3://bucket-name/ - don't forget the trailing slash)
Then it gets kind of crappy. After choosing the JSON format, we get to specify the fields for the table:
You'll see that the Column Type is limited to primitive data types, which is all very well, but the key benefit of representations like JSON and XML is that they can capture complex types. You can create tables with complex types, by using the struct column type, but each level of nesting in JSON requires a new struct, so for a simple 2 levels of nesting we get:
CREATE EXTERNAL TABLE release_messages_3 ( MessageSender struct<`PartyId`:string, `RecordingIds`:struct<`ISRC`:string>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://ddex-samples-as-json/'
And, hey presto we can run a SQL query against this table:
But for a large complex document like a DDEX specification, this is impractical. Even to get this to work I had to change the Lambda to extract these ids from the file, thus transforming the document. Which defeats the attraction of using Athena for this in the first place.
That's not, of course, to say that Athena is no good. It's just to say that it doesn't work for this use case. And it's not to say that this use case won't work in the future, or even soon. Amazon can move pretty fast between versions. It's just to say that Athena is young and feels like it's not quite at v1.0 level yet.
There are other little glitches. I can't see how to delete a table, but if I move from a query without saving it, it's gone. My guess is that if Athena looks like being useful I bet Amazon will make the UX slicker and the import more intelligent. I'm going to keep an eye on Athena, I think it might be going places.