Implementing SQL API similar to GraphQL
Implementing SQL API similar to GraphQL
So they say REST is dead and inefficient. Let’s start a revolution and invent new query language for your API. If you are following tech trends, you could not noticed new technology called GraphQL from Facebook.
For those who are wondering what GraphQL is, here is description from their site: It is a query language for APIs and a runtime for fulfilling those queries with your existing data. GraphQL provides a complete and understandable description of the data in your API, gives clients the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, and enables powerful developer tools.
GraphQL is incredibly powerful --- so powerful, in fact, that it is used by corporations such as Facebook to drive their large, complex social systems for billions of users.
In fact, it’s so powerful that require “new way of thinking” and a horde of dependencies.
Now stop and think, do you build complex systems for billions of users? If the answer is yes, you probably may stop reading, but I have a feeling that most of you will say No. I love the simplicity (who not), and my experiences show that if something works for large and complex corp like Facebook, rest of us probably should do the reverse.
GraphQL is not a data query framework, it is a patch to fix organizational and technical issues happening on a large scale. The same story is behind micro services pattern, and now see how the sentiment of the crowd changed from “let’s use it for everything” to “screw it, I’ll better stick with my monolith LAMP stack”.
So, let’s assume you are an average size company, and you want GraphQL like, efficient, flexible and secure way to share the data inside and outside of your organization. Let’s open GraphQL site and read how it position itself: “get many resources in a single request”, “ask for what you need”, “describe what’s possible with a type system”, etc. Wait, did I just read the description of a good old SQL? As usual in IT world, answers to the new challenges can be found in our past.
Raise a hand if you prefer having REST or GraphQL API instead of having SQL access, which will offer you ultimate flexibility, well known powerful 30-year-old language, and tons of existing libraries? Now imagine how happy your customers will be, if they can pipe your API directly to internal reports or dashboards like charts.io?
I do not see too many hands.
If you read till this point you probably have a lot of valid concerns: how do I manage access to the data, what about security, and probably way more, so let’s try to resolve them together.
For simplicity I choose PostgreSQL, because well, it’s the best general-use SQL database you can find; it is flexible and powerful enough to be tuned for almost any use-case.
Limiting access to the data
Each data consumer, service inside your organization or just your customer, should see only the data he have access to. PostgreSQL have powerful role based security system, and even allow you to specify per-row access.
Let’s start with defining roles:
-- Note that it is the same as CREATE USER\
CREATE ROLE <user>;-- Each role can have own password\
ALTER USER <user> WITH PASSWORD '<pass>';-- Allow to connect our database\
GRANT CONNECT ON DATABASE <db> TO <user>;-- Allow to see the tables in `public` schema\
-- You may want to create multiple schemas for various use cases\
GRANT USAGE ON SCHEMA public TO <user>;
When we talking about security it is better to always white-list things, rather then blacklist. Thankfully by default new roles does not have access to anything, so lets add few security roles:
-- For tables that can only be read\
GRANT SELECT ON <table> IN SCHEMA public TO <user>;-- For full read-write support\
GRANT SELECT, INSERT, UPDATE, DELETE ON <table> IN SCHEMA public TO <user>;
So, now we have access only to the tables he have, but he still can query or update ALL the data, which probably not what we want.
PostgreSQL since version 9.5 added support for row-security policies, which means that you can have a large global table, and individual users will see, and update, only the data allowed by policy.
Let’s assume you have an accounts
table, which contains customer login, and table data
with actual data which have the account_id
column. Also we will defined multiple Postgres roles for each account_id
(role have same name as id). current_role
is the special SQL function which returns currently logged user.
-- First we need to enable this feature\
ALTER TABLE <table> ENABLE ROW LEVEL SECURITY;-- Allow to select only rows which user should have access to\
CREATE POLICY per_user_accounts ON data FOR SELECT\
USING (account_id = current_role);-- Allow updating only rows assigned to our account_id\
CREATE POLICY per_user_accounts ON data FOR UPDATE\
USING (account_id = current_role);
You can use more complex security checks, actually USING
expression may contain almost any SQL statement (except aggregate or window functions).
Using functions for modifying data
In real-world allowing API consumers to directly UPDATE
or INSERT
can be a bad idea, especially if it have some business logic. And maybe it makes sense to use server logic for that. But actually we can do that in a database as well.
My experience shows that most of us underestimate database capabilities, and use it just as a dumb query engine. In fact, modern databases provide you fully functional programming experience, with unique capabilities and the fastest ever data access.
So, instead of letting our consumers directly update data, let’s write a function that will be stored in our database, and will be used to modify the data.
CREATE FUNCTION update_data (ts timestamp, val integer)\
RETURNS void\
AS $$ # Assume that updating data available only to enterprise customers\
r = plpy.execute("SELECT is_enterprise FROM accounts WHERE account_id = %d" % current_role, 1)[0] if r['is_enterprise'] == True:\
plan = plpy.prepare("UPDATE data SET val = $1 WHERE ts = $2", ["integer", "timestamp"])\
plpy.execute(plan, [val, ts]\
else:\
plpy.info("Unauthorized access")$$ LANGUAGE plpythonu;
Python? Yes! PostgreSQL allows you to variety of languages for writing procedures, and while I personally prefer native PLSQL dialect, using Java, R, or Python is totally OK. Want to add rate limiting, or audit, not a problem!
Notice that we did not specify account_id
while updating data, because it handled automatically. Once we wrote per row security policies, they start to work everywhere, and we are sure that user has access to only subset of data.
Dealing with slow queries
When you give the user such a powerful tool, its just a matter of time once they will screw it up, write an infinite cycle and hang your database. In my view the easiest way to fix the issue, just to disallow slow queries, at all.
The code below will automatically abort all statements longer than 1 second and issued by the
ALTER ROLE <user> SET statement_timeout=10000;
Prioritizing memory
PostgreSQL does have some tunable parameters for memory use that are per-client, particularly [work_mem](http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-WORK-MEM)
and [maintenance_work_mem](http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM)
. You can set these to conservative, low values in postgresql.conf
then use the SET
command to assign higher values to them for a particular backend, e.g. SET work_mem = '100MB';
.
You can set different values for work_mem
and maintenance_work_mem
using per-user GUC variables. For example:
ALTER USER <user> SET work_mem = '50MB';
How do I access it from the web
Ok, there is chance that I convinced you that exposing SQL for server side apps can be a good idea, but what about exposing it to the web?
Let’s do it the same way as GraphQL does; we will create simple /sql
endpoint which, apparently, accept SQL statements.
Before executing the query on endpoint side, you just need to set proper ROLE
SET ROLE <account_id>
After the role set all the security policies will be automatically applied to running query.
Generating SQL and parsing response on the client side should not be a problem, will leave it to the reader imagination.
SQL is so complex
SQL is like Excel, it is so common that even most of managers know the basics and can write the simple SELECT *
query and pipe it to the charts.io or similar dashboard. In addition, most of the existing tools have column auto discovery and suggestions, or even visual query builders.
You may say that you can’t teach your front-end engineer to write JOINS, not saying about WINDOW functions or efficient index usage. And you actually do not need to do that!
SQL allows you to create VIEWS, kinds of black-boxes. You can define a query of any complexity, and just assign it to the view. For end-user, it will act just like a table.
Anywhere where you need custom logic, like aggregation functions or joins, just create a new view! That’s how it simple. Moreover, you can limit user access only to the views, without giving access to original tables.
Querying from multiple data sources
GraphQL can hide complex logic behind the query, and you can mashup data from multiple data sources.
It’s a question of architecture you are building, and in most cases when you are querying data from 3-rd party API, you are probably caching it somewhere, so why not to database? You also should consider what is more consuming to you, putting all the data into the single database, or having dozens of multiple ones. It’s all up to you, but let me tell you a story about my latest project, on it handle this problem using the only database.
So, I need to gather information from multiple API sources like Github, Clearbit, Email validation and etc., mash it up, and expose to the user. Querying third party services on user request almost always a bad idea, the chance that something goes wrong too high. We probably want to cache the data, periodically update it, and show the user only the cached version.
I have a worker who periodically fetches raw data from this API’s, and put it to the database. Next, there are views which extract data from JSON (yes, you can query JSON in Postgres!), and functions which normalize it:
CREATE VIEW github_meta_ex AS\
SELECT github_meta.login,\
format_company(payload ->> 'company') as company,\
(payload ->> 'followers')::numeric as followers\
...\
FROM github_meta-- And similar views for each service
After data is fetched, I have final MATERIALIZED VIEW
which mashes all the data together. Materialized in this case means that view will take snapshot of the queried data, and create a temporary table to store it. No matter how complex your queries are, and how many JSON processing you do, for end-user it will be as fast as the simple static table.
CREATE MATERIALIZED VIEW meta_ex AS\
SELECT g.login,\
g.email,\
COALESCE(g.company, c.employment) AS company\
...\
FROM FROM github_meta_ex as g\
LEFT JOIN clearbit_meta_ex as c ON g.email = c.email\
...
And periodically, usually, when you updated one of the dependent tables, you just call REFRESH MATERIALIZED VIEW meta_ex
, which will in background updates the data of the view, without blocking a user.
Final words
GraphQL is an awesome piece of technology, addressing lot of issues happening in modern software development, but most likely it is not the issues you are experiencing in your day life.
Modern SQL can give you enough flexibility to build secure and efficient access to you data, and you can implement it RIGHT NOW, without having to decide if you should rewrite half of your application from scratch to support “modern” technology (which is most likely is just a bridge to something new which will come next years).
I’m personally happily using such SQL API in production, and my customers could not be happier, it just works for both sides.