Friday, March 3, 2017

CONNECT BY is dead, long live CTE! In MariaDB 10.2!

Yes, you got that right, the old CONNECT BY as used by recursive SQL with Oracle has been replaced by Common Table Expressions, or the WITH statement in SQL:1999 which is now also available in MariaDB 10.2. Now, the SQL WITH construct, using Common Table Expressions or CTE,  is useful for other things than just recursive queries, but this is the one feature that WITH enables that was previously very hard to do without some procedural code, the non-recursive use of Common Table Expressions could previously mostly be replaced by using temporary tables.

This blog post will explain what recursive SQL is all about and why this is useful, and I will show some examples of both CONNECT BY and how the same SQL is written using the WITH clause.

The most common example for recursive SQL is probably for doing a parts explosion, where we have a table of parts of some component where each part is either a main, top level, part or is a part of another part. For example a car with an engine, where the engine consists of pistons, cylinders and a camshaft, where the latter also includes some camshaft bearings. I think you get the basic idea here. To query this data to create a list of components that make up some other component, you need to recursively visit the data, i.e.. each row is evaluated using conditions from any other row already fetched, except the first row fetched that is.

Now, let's look at some data first. I assume we have two tables here, one table that contains information on the different parts and then one table that contains information on the individual parts and then one table that contains the hierarchy of the parts, called components. Like this:
CREATE TABLE parts(part_id INTEGER NOT NULL PRIMARY KEY,
  part_name VARCHAR(60) NOT NULL);

CREATE TABLE components(comp_id INTEGER NOT NULL PRIMARY KEY,
  comp_name VARCHAR(60),
  comp_count INTEGER NOT NULL,
  comp_part INTEGER NOT NULL,
  comp_partof INTEGER,
  FOREIGN KEY(comp_part) REFERENCES parts(part_id));
ALTER TABLE components ADD FOREIGN KEY(comp_partof) REFERENCES components(comp_id);


The two things to note here is that the components table has a column, comp_partof, that implements the hierarchy and that there is a self-referencing FOREIGN KEY constraint on this table. Given these tables, assuming that we are a small privately held car-manufacturing company in southern Germany, let's insert some data:
INSERT INTO parts VALUES(1, 'Car');
INSERT INTO parts VALUES(2, 'Bolt');
INSERT INTO parts VALUES(3, 'Nut');
INSERT INTO parts VALUES(4, 'V8 engine');
INSERT INTO parts VALUES(5, '6-cylinder engine');
INSERT INTO parts VALUES(6, '4-cylinder engine');
INSERT INTO parts VALUES(7, 'Cylinder block');
INSERT INTO parts VALUES(8, 'Cylinder');
INSERT INTO parts VALUES(9, 'Piston');
INSERT INTO parts VALUES(10, 'Camshaft');
INSERT INTO parts VALUES(11, 'Camshaft bearings');
INSERT INTO parts VALUES(12, 'Body');
INSERT INTO parts VALUES(13, 'Gearbox');
INSERT INTO parts VALUES(14, 'Chassie');
INSERT INTO parts VALUES(15, 'Rear axle');
INSERT INTO parts VALUES(16, 'Rear break');
INSERT INTO parts VALUES(17, 'Wheel');
INSERT INTO parts VALUES(18, 'Wheel bolts');

INSERT INTO components VALUES(1, '320', 1, 1, NULL);
INSERT INTO components VALUES(2, NULL, 1, 6, 1);
INSERT INTO components VALUES(3, NULL, 1, 7, 2);
INSERT INTO components VALUES(4, NULL, 4, 8, 3);
INSERT INTO components VALUES(5, NULL, 4, 9, 3);
INSERT INTO components VALUES(6, NULL, 1, 10, 3);
INSERT INTO components VALUES(7, NULL, 3, 11, 6);
INSERT INTO components VALUES(8, NULL, 1, 12, 1);
INSERT INTO components VALUES(9, NULL, 1, 14, 1);
INSERT INTO components VALUES(10, NULL, 1, 15, 9);
INSERT INTO components VALUES(11, NULL, 2, 16, 10);


INSERT INTO components VALUES(12, '323 i', 1, 1, NULL);
INSERT INTO components VALUES(13, NULL, 1, 5, 12);

If you are not into mechanics, let me tell you that there are more parts than this to a car, for example I left out a few critical components, such as the cupholder, the dog that stands on the pickup cargo area and the insulting bumber-sticker, but I think you get the idea. Note that there are two "main" components, the '320' and '323 i' and that these are top level components are indicated by the comp_partof column being set to NULL.

Now, assume you want to list all the parts that make up a 320. The way this works when using the CONNECT BY syntax, you compose one single SQL statement and provide a CONNECT BY clause to indicate the relationship. Like this:
SELECT LPAD('-', level, '-')||'>' level_text, comp_count, NVL(comp_name, part_name) name
FROM components c, parts p
WHERE c.comp_part = p.part_id
START WITH c.comp_name = '320'
CONNECT BY PRIOR c.comp_id = c.comp_partof;


Let me explain this a bit, but there is nothing really magic here. We are selecting from the two tables and joining them just as usual. Then we use the START WITH clause to define the top level component and then the rest of the components are have a comp_partof of a component that matches the comp_id of the START WITH component or a  comp_id of any other component that has been fetched.
This way of writing recursive SQL has some advantages, such as it is relatively compact and is easy to understand. The disadvantage is that there are some quirks and limitation to this and that once your queries gets more complex, CONNECT BY gets a bit hairy. One sure sign that CONNECT BY is going away, even though I and many others tend to like it because of the ease of use, is that even Oracle, as of Oracle 11g, also has implemented the WITH construct, or Common Table Expressions or CTE. So looking at the above statement how this would work in MariaDB 10.2, this is what it would look like using the WITH construct:
WITH RECURSIVE comp(comp_id, comp_name, comp_partof, comp_count) AS (
  SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
  UNION ALL
  SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
  FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;


Comparing this CTE version to the CONNECT BY version as above, this is a bit more complex, but how it works is actually pretty clear once you look at it carefully. To begin with, the top level item or anchor is the first SELECT in the UNION ALL and the following components are fetched using the second SELECT. Then the recursive aspect is handled by this UNION being run until there are no more rows returned from it? As you can see, although this requires more text and more complex SQL to write, it is also a fair bit more flexible. For example, the anchor point is defined by a completely separate SELECT which means it can be whatever SELECT you want, selecting from any odd table. Secondly, the column you use and the conditions for defining the hierarchy can be as complex as you want. And thirdly, there is also the power of that last SELECT which in the case above just gets the data from the UNION, but you can actually apply any kind of filter, ordering or column filter to this query. The result of the query above is this:
comp_count      comp_name
1               320
1               4-cylinder engine
1               Body
1               Chassie
1               Cylinder block
1      
        Rear axle
4      
         Cylinder
4
               Piston
1      
        Camshaft
2
               Rear break
3      
        Camshaft bearings

Before I finish this off, the WITH RECURSIVE statement is somewhat overly complex, in MariaDB 10.2 you can for example skip listing the column names of the recursive table, like this:
WITH RECURSIVE comp AS (
  SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
  UNION ALL
  SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
  FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;


And although Oracle 11 and up supports the CTEs, it works a bit differently. For one thing, the RECURSIVE keyword isn't supported (it is assumed to be recursive by default) and the way I read the SQL standard, this is actually wrong, for recursive queries you have to use the RECURSIVE keyword. Second, Oracle does require the SELECT-list. So in Oracle, you would see something like this:
WITH comp(comp_id, comp_name, comp_partof, comp_count) AS (
  SELECT comp_id, comp_name, comp_partof, comp_count
    FROM components JOIN parts ON comp_part = part_id
    WHERE comp_partof IS NULL AND comp_name = '320'
  UNION ALL
  SELECT c1.comp_id, p.part_name, c1.comp_partof, c1.comp_count
  FROM components c1 JOIN parts p ON c1.comp_part = p.part_id
    JOIN comp c2 ON c1.comp_partof = c2.comp_id)
SELECT comp_count, comp_name FROM comp;

Yes, we are all happily following the same SQL standard. Somewhat...
See the MariaDB Knowledge Base for more information on common table expressions.

Happy SQL'ing
/Karlsson

Tuesday, February 28, 2017

JSON with MariaDB 10.2

JSON is fast becoming the standard format for data interchange and for unstructured data, and MariaDB 10.2 adds a range on JSON supporting functions, even though a JSON datatype isn't implemented yet. There are some reasons why there isn't a JSON datatype, but one is that there are actually not that many advantages to that as JSON is a text-based format. This blog post aims to describe JSON and the use cases for it, as well as to describe the MariaDB 10.2 JSON functions and uses for these, as well as showing some other additions to MariaDB 10.2 that are useful for JSON processing.

So to begin with then, why do we need JSON? Or to put it differently, why do we not store all data in JSON? Well, the reason as I see it is that some data we work with really is best treated as schemaless whereas some other data really should be handled in a more strict way in a schema. Which means that in my mind mixing relational data with unstructured data is what we really want. And using JSON for unstructured data is rather neat, and JSON is even standardized (see json.org).

There are reasons why this hasn't always been so.  When the sad old git that is writing this stuff started working in this industry, which I think was during the Harding administration, computers were rare, expensive, handled only by experts (so how I got to work with them is a mystery) and built from lego-bricks, meccano and pieces of solid gold (to keep up the price). Also, they were as powerful as a slide-ruler, except it was feed with punched-cards (and probably powered by steam). Anyway, no one in their right mind would have considered string pictures of cute felines as something to be on a computer, or actually stuff to be stored in the database. The little that would fit was the really important stuff - like price, amount in stock, customer name, billing address and such - and nothing else.  And not only that, stuff that was stored had some kind of value, somehow, which meant it had to follow certain rules (and following rules is something I am good at? I wonder how I ended up in this business. Again). Like, a price had to be a number of some kind, with a value 0 or higher and some other restrictions. As you see, these were hard and relentless times.

And then time moved on and people started buying things on the internet (whatever the internet is. I think it is some kind of glorified, stylish version of punched cards) and stuff such as Facebook and Google came around. The issue with computer storage was now not how to fit all that nicely structured data in it, but rather once we have filled that hard drive on your desktop with all the product, customers and transactions from Amazon (I think Amazon has something to do with Internet, but I am not sure) and a full 17.85% of that drive is now occupied by that old-style structured data, what more do we put in there? Maybe we could put some more information on the items for sale in that database, and some general information on who is buying it? That should fill up that disk nicely, right? Well, yes, but that new data, although related to the structured data I already have, is largely unstructured. Say, for example, you write a review of a product on Amazon late in the morning after a good deal of heavy "partying" (which is not an Internet thing, I think), the contents of that would hardly be considered "structured". If you didn't like the product (which you probably didn't), then the appropriate terms for large parts of that review would probably be "profanity" or "foul language").

The way to deal with the above is a mix of structured and unstructured data, with some kind of relation between the two. Like a column of unstructured data in each relational database table (or should I say "relation", just to show my age? Or maybe I should pretend to be really young, modern and cool, possibly sporting a hipster beard and all that, by calling it a "collection").

With that out of the way, let's consider an example using structured as well as non-structured JSON data. Assume we have a store selling different types of clothing, pants, jackets, shoes and we are to create a table to hold the inventory. This table would have some columns that are always there and which have the same meaning for all rows in the table, like name, amount in stock and price. These are items that are well suited for a relational format. On top of this we want to add attributes that have different meaning for each type of or even each instance of items. Here we have things like colour, width, length and size. These we consider non-relational as the interpretation of these attributes are different depending of the type of garment (like size M or shoe sizes or a "zebra striped" colour) and some garments might have some unique attribute, like designer or recommended by staff or something. Our table might then look something like this:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  price DECIMAL(9,2) NOT NULL,
  stock INTEGER NOT NULL,
  attr VARCHAR(1024));

In this table we have a few columns that look like columns in any relational database table, and then we have a column, called attr, that can hold any relevant attribute for the garment in question and we will store that as JSON a JSON string. You probably notice that we aren't using a JSON datatype here as that is not present in MariaDB, despite that there are JSON functions, but those JSON functions act on a text-string with JSON content. These functions are introduced in MariaDB 10.2 (which is in Beta as I write this), but there are a few bugs that means you should use MariaDB 10.2.4 or higher, which means as for now we assume that MariaDB 10.2.4 or higher is being used.

But there is one issue with the above that I don't particularly care for and that is, as the attr column is plain text, any kind of data can be put in the attr column, even non-valid JSON. The good thing is that there is a fix for this in MariaDB 10.2, which is CHECK constraints that actually work, and this is a little discussed feature of MariaDB 10.2. The way this works is that this kind of constraint kicks in whenever a row is INSERTed or UPDATEed, any CHECK constraint runs and validates the data and if the validation fails the operation also fails. Before I show an example I just want to mention one JSON function we are to use here, which is JSON_VALID which takes a string and checks if it is valid JSON. Note that although CHECK constraints are particularly valid here, check constraints can be used for any kind of data validation.

Armed with this, let's rewrite the statement that creates the table like this:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  price DECIMAL(9,2) NOT NULL,
  stock INTEGER NOT NULL,
  attr VARCHAR(1024),
  CHECK (JSON_VALID(attr)));


Let's give this a try now:
MariaDB> INSERT INTO products VALUES(NULL, 'Jeans', 10.5, 165, NULL);
ERROR 4025 (23000): CONSTRAINT `CONSTRAINT_1` failed for `inventory`.`products`

Ok, that didn't work out. What happens here is that a NULL string isn't a valid JSON value, so we need to rewrite our table definition:
MariaDB> CREATE TABLE products(id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  price DECIMAL(9,2) NOT NULL,
  stock INTEGER NOT NULL,
  attr VARCHAR(1024),
  CHECK (attr IS NULL OR JSON_VALID(attr)));

Following this we can try it again:
MariaDB> INSERT INTO products VALUES(NULL, 'Jeans', 10.5, 165, NULL);
Query OK, 1 row affected (0.01 sec)
MariaDB> INSERT INTO products VALUES(NULL, 'Shirt', 10.5, 78, '{"size": 42, "colour": "white"}');
Query OK, 1 row affected (0.01 sec)
MariaDB> INSERT INTO products VALUES(NULL, 'Blouse', 17, 15, '{"colour": "white}');
ERROR 4025 (23000): CONSTRAINT `CONSTRAINT_1` failed for `inventory`.`products`

That last statement failed because of malformed JSON (a double quote was forgotten about), so let's correct that:
MariaDB> INSERT INTO products VALUES(NULL, 'Blouse', 17, 15, '{"colour": "white"}');
Query OK, 1 row affected (0.01 sec)

One thing that has yet to be discussed is indexes on JSON values. As the attr column in our example is a plain text, we can of course index it as usual, but that is probably not what you want to do, rather what would be neat would be to create an index on individual attributes in that JSON string. MariaDB doesn't yet support functional indexes, i.e. functions not on values but on computed values. What MariaDB does have though is Virtual Columns, and these can be indexed and as of MariaDB 10.2 these virtual columns don't have to be persistent, (read more on Virtual Columns here: Putting Virtual Columns to good use).

The easiest way to explain this is with an example. Let's say we want an index on the colour attribute, if such a thing exists. For this we need two things: A virtual column that contains the colour attribute as extracted from the attr column, and then an index on that. In this case we will be using the JSON_VALUE function that takes a JSON value and a path, the latter describing the JSON operation to be performed, somewhat like a query language for JSON.

We end up with something like this:
MariaDB> ALTER TABLE products ADD attr_colour VARCHAR(32) AS (JSON_VALUE(attr, '$.colour'));
MariaDB> CREATE INDEX products_attr_colour_ix ON products(attr_colour);

With that in place, let's see how that works:
MariaDB> SELECT * FROM products WHERE attr_colour = 'white';
+----+--------+-------+-------+---------------------------------+-------------+
| id | name   | price | stock | attr                            | attr_colour |
+----+--------+-------+-------+---------------------------------+-------------+
|  2 | Shirt  | 10.50 |    78 | {"size": 42, "colour": "white"} | white       |
|  3 | Blouse | 17.00 |    15 | {"colour": "white"}             | white       |
+----+--------+-------+-------+---------------------------------+-------------+
2 rows in set (0.00 sec)

And let's see if that index is working as it should:
MariaDB> EXPLAIN SELECT * FROM products WHERE attr_colour = 'white';
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
| id   | select_type | table    | type | possible_keys           | key                     | key_len | ref   | rows | Etra       |
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
|    1 | SIMPLE      | products | ref  | products_attr_colour_ix | products_attr_colour_ix | 99      | const |    2 | Uing where |
+------+-------------+----------+------+-------------------------+-------------------------+---------+-------+------+------------+
1 row in set (0.00 sec)


And just to show that the column attr_colour is a computed column that depends on the attr column, lets try updating the colour for the blouse and make that red instead of white and then search that. To replace a value in a JSON object MariaDB 10.2 provides the JSON_REPLACE functions (for all JSON functions in MariaDB 10.2 see MariaDB Knowledge Base).
MariaDB> UPDATE products SET attr = JSON_REPLACE(attr, '$.colour', 'red') WHERE name = 'Blouse';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
MariaDB> SELECT attr_colour FROM products WHERE name = 'blouse';
+-------------+
| attr_colour |
+-------------+
| red         |
+-------------+
1 row in set (0.00 sec)


There is more to say about JSON in MariaDB 10.2 but I hope you now have a feel on what's for offer.

Happy SQL'ing
/Karlsson

Monday, January 16, 2017

Released MyMSSQLDump 1.1

My program for exporting data from MSSQL and Sybase into a whole bunch of other formats, including:
  • JSON
  • HTML
  • CSV
  • MySQL (mysqldump style)
  • MYSQL / Sybase INSERT-style
  • Oracle INSERT-style
is now released in version 1.1. There is a whole bunch of new things, most notable Oracle style export format, but also:
  • DATETIME datatype formatting
  • DATETIMEOFFSET formatting
  • Other temporal datatype support
  • Much more flexible formatting in general
  • More tests
As usual this is GPL v2 licenced and is available to download from sourceforge.

Happy SQL'ing
/Karlsson

Thursday, January 5, 2017

Powerful MariaDB exports using MyMDBDump

You can export data from MariaDB using mysqldump and a bunch of other tools, but if you need really flexible output format, this might not be what you need. Instead, give MyMDBDump a try. This tool will export data just like mysqldump, but the output format is a lot more flexible. I urge you to test it and check out the documentation, but some if the features are, in short:
  • Dynamic column support - Dynamic columns can be exported as binary, JSON or even as SQL, where I think the latter is a pretty unique feature.
  • Oracle export format - This exports data as INSERT statements, just like mysqldump, but in an Oracle friendly way.
  • MS SQL / SQL Server format - This exports data as SQL Server friendly INSERTs.
  • Binary data formats - Supporting plain binary, base64, hex and a number of more formats.
  • Transaction support.
  • JSON export support - Including embedding dynamic columns as JSON.
  • UTF8 support and UTF8 validity checking.
  • Generated ROWNUM column output
  • FLOAT and DOUBLE formatting
  • DATETIME, TIMESTAMP, DATE and TIME custom formating.
And a lot more. Download from sourceforge.

Happy SQL'ing
/Karlsson