CSV Imports into Google Cloud BigQuery

If you have a large Postgres database, consider trying Google Cloud BigQuery for OLAP-style reporting queries.

database> COPY (<query_to_denomralize_report_data>) TO '<dest_filename>' DELIMITER ',' CSV HEADER;

Then upload it to GCS:

$ gsutil cp <dest_filename> gs://<dest_bucket>/<dest_filename>

Then, follow Loading Data into BigQuery (with Google Cloud Storage). For this to work, you’ll have to define the schema and skip the first leading row.

My most recent load job took 4 seconds, loading a very large CSV file. Query outputs can be saved to Google Sheets or other BigQuery tables.

For those of you who are looking for something similar on AWS, check out Amazon Athena.

Book Review: Designing Data-Intensive Applications

book-cover

Some days I feel that I’ve returned to my old DBA role, well… sort of. A lot of my day to day work revolves around a vast amount of data and applications for data storage and retrieval. I haven’t been a DBA by trade since 2011, nearly an eternity in the tech industry.

To me, Designing Data-Intensive Applications was a wonderful survey of “What’s changed?” as well as “What’s still useful?” regarding all things data. I found Designing had practical, implementable material, especially around newer document databases.

This book is a good balance of theory and practice. I particularly enjoyed the breakdown of concurrency related problems and adding concrete names to classes of issues I’ve seen in practice (e.g., phantom reads, dirty reads, etc).

The section about distributed systems theory was well done but it was a definite lull in the rest of the book’s flow.

The book’s last section and last chapter were both strong. I would recommend this book to all of my colleagues.

Looking at PostGIS intersects queries

I’ve been looking hard at datastores recently, in particular PostgreSQL + PostGIS. In school, I did a few GIS related things in class and I assumed it was magic. It turns out there is no magic.

At work, we have an application that accepts polygons and returns objects whose location are within that polygon. In GIS-speak this is often called an Area of Interest or AOI and they’re often represented as a few data types: GeoJSON, Well-known Text and Well-known Binary (WKT/WKB).

So what happens when the request is accepted and how does PostGIS make this all work?

Here’s a graphical representation of my GeoJSON (the ‘raw’ link should show the GeoJSON document):


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

geojson.json

hosted with ❤ by GitHub

And the same as WKT:


POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))

view raw

map.wkt

hosted with ❤ by GitHub

In the PostGIS world, geospatial data is stored as a geometry type.

Here’s a query returning WKT->Geometry:


db # SELECT ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))
');
-[ RECORD 1 ]——-+——————————————————————————————————————————————————————————————-
st_geometryfromtext | 0103000000010000000500000000000040C01656C0CDCEF4B16C49424000000040B80F56C0CDCEF4B16C49424000000040B80F56C0059C012DB150424000000040C01656C0059C012DB150424000000040C01656C0CDCEF4B16C494240

view raw

output

hosted with ❤ by GitHub

The PostGIS geometry datatype is an lwgeom struct.

So, to find all of the objects in a given AOI, the SQL query would look something like this (PostGIS FAQ):


SELECT id FROM objects WHERE geom && ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))')

OK, but what does the && operator do? Thanks to this awesome dba.stackexchange answer, we finally get some answers! In a nutshell:

  • && is short for geometry_overlaps
  • geometry_overlaps passes through several layers of abstraction (the answer did not show them all)
  • in the end, floating point comparisons see if the object in the database’s polygon is within the requested polygon