Karlsson on databases and stuff: June 2009

Wednesday, June 24, 2009

Editing SQL for MySQL with SciTe / Scintilla

This is just a brief announcement that the popular Open Source text editor control Scintilla, and the SciTe editor, which is a text editor based on Scintilla and part of the same project, these days properly supports MySQL Syntax.

Scintilla has supported generic SQL for quite a while, but this was heavily influenced by Oracle, and for example MySQL procedure code and folding didn't work properly. This has been fixed to an extent now, and there is a MySQL specific lexer that I contributed. This has been part of my own MyQuery tool since version 2.0, but that is a special version with the MySQL lexer patched in, but this patch is now included in the standard Scintilla and SciTe.

And just so you know, although my MyQuery tool is Windows only (yes, this is still the case), Scintilla and SciTe is not, they support Linux, GTK, Windows etc. So you can always use SciTe if you are on Linux, and get proper MySQL code folding etc.

/Karlsson
Hacking away at MyQuery

Monday, June 22, 2009

MyQuery 2.4 - Of multithreading, messages and Windows

Warning!. This is not much database related, except that I write this as I am developing my Database Query Tool MyQuery 2.4 right now.

While working with MyQuery 2.4, which should be ready for beta real soon now, I have gotten myself into an issue with Windows and Multithreading. Actually, this issue exists to some extent most message based systems I guess, assuming they also support multithreading.

If you have not developed with Win16/Win32 or some other message based GUI system, then let me fill you in on how Windows does this. This to a large extent dates back to when Windows wasn't truly multi-threading at all (such as Win 3.0, Win 3.11 etc.). Even in those old versions of Windows, you could switch from application to application, window to window anyway. And the application didn't seem to be blocked (I am ignoring any general performance issues with those old versions of Windows here).

How did that work? It worked as the applications weren't just running away and handling all keyboard and mouse input all on it's own, and wasn't blocking when no action was needed. Instead, an application has a "message loop" that handles messages. If a message is to be handled, then it is handled, and then control is handed back to Windows itself, until I have another message meant for yours truly. OK Fine, this is a cheap way of "multithreading" which is also reasonably lightweight and works quite well.

What is not so good is that we, at times, need to deal with blocking operations. As long as I do not return from handling a message, the application is blocked now. And this goes for ALL parts of the application, GUI included! No repaints, nothing, dead. And there are some blocking operations we need to deal with here.

And not only that. Let's say we have a database connection that is shared by several objects, active at the same time, in your application. In my case, I have a bunch on modeless dialogs that show database data that share one connection. Usually, you would think that just putting a mutex on that connection would be enough: One Window does it's thing:

Wait for the Mutex until I have it.
While I have the Mutex, do my database things.
Release the Mutex when the database thing is done.

Now, I can end up in the same situation as when I have multiple interlocking mutexes, I get a deadlock. But this time the situation is worse, as the waiters aren't from the same group (i.e. if I have several mutexes deadlocking, the Mutex system can potentially figure this out). The Windows messages are stuck, as my modeless dialog is stuck waiting for a mutex, and the other modeless dialog is stuck, as it is waiting for a message whilst holding the Mutex.

In Windows, a modeless Dialog by default use the applications main message loop. If I remember correctly MFC (which I do not like, and I am not alone in this it seems) does this by using a smarter than usual message loop. I could have done that in my case I guess, but I decided on a different approach:

Send an application defined message to do the database processing.

In the processing of this message:

Get the Mutex, but do not wait for it.
If I got, then do the database processing and release the Mutex and reset the flag that I am waiting for a Mutex.
If I didn't get it, set a flag that I am waiting to get it.

For any message that i am not processing (I could probably pick some useful idle message here also):

If the flag that I am waiting for the mutex, then send the User defined message to myself.

As we can see, I am avoiding deadlocks here by converting blocked operations to messages, making everything that is blocking into a polling process for the resource in question, using a message for each poll. One thing I cannot do here though, is to send any messages to myself whilst I am processing the database stuff, at least not any that can potentially block (other messages will nicely queue up of course).

Is this a good an accepted way of doing things? Maybe, at least it seems to work. I can have a script running in one window, and another window using the same database connection, all protected by a mutex, but none of the windows blocking. Next time around, I will probably figure out a way to massage the message loop so I can do this smarter, but for now, this works.

And this is example of how locking always comes back at you. In particular when multiple locking systems interact (like a GUI, a threading system and a database. Or a database that at one place locks the table, and at another place locks a row).

/Karlsson
Back to MyQuery. And I will not blog more about it until 2.4 is done. I promise...

Friday, June 19, 2009

Working on MyQuery 2.4, issues with the C API (Opinionated)

I am currently working on a 2.4 version of MyQuery. I have included a few cool features in it, and I have also done some significant changes to the code.
Above all, all database access function has been put in one single file with a set up functions to allow me to add other types of connections later. In the process, I have figured out a few things about the MySQL C API that I am not so hot on.
To take one example, I'm not too keen on the naming convention of fields and columns. It seems that, in the naming of functions in the API, someone seems to think that a column and a field is the same thing. They are not.
This is how many people look at things, including myself, when we speak of relational data:

A row is one of 0 or more rows in a database.
A column is one of 1 or more columns in a dataset.
A field is the "crossing" of a row and a column.

Now, you are welcome to disagree with this (it is midsummer after all, so I am quite relaxed), but whichever the semantics, using column and field interchangeable not not really appropriate, and I think you agree. And even if you don't, and we would say that in some terminology, they are the same, then just one of the terms should be used. Not so in the MySQL C API docs:
mysql_field_count - Returns the number of result columns for the most recent statement

Also, why is there both a mysql_field_count and a mysql_num_fields function? Yes, I know the use of both of them, but the former seems rather useless. Or rather, on a high-level-view, design vise, this is useless. Although it does have a use, this is mainly because of other weird aspects of the C API.

If my SQL returns just 1 result, then I get just on result. If my SQL returns more than 1 result, I might get results back, the last one being empty? I don't see why this is necessary? Or rather, again, I read the documentation and I realize why the protocol needs to send an extra resultset, but why does that have to be exposed to the user of the API. Come on, if I can get rid of it in my own abstraction of the API, it can be gotten rid of in the API itself. Right?

And all this said, the work in progress for a better implementation of teh C API, in the shape of the C Connector (Thanx Jim), there is better stuff coming here. Hey, we need a better protocol, and a better C API!

/Karlsson

Wednesday, June 17, 2009

Of Oracle API's, array interface and bind variables

As we know, the MySQL prepared statements API leaves a bit to be desired, although there is a fair amount of progress. With MySQL, using the "normal" API or the prepared statement API usually doesn't have that much different in terms of performance.

When you use Oracle though, things are different. There is no separate "prepared statement" API, there is just on interface, which has a lot of functions and structures, and isn't the easiest to use, because of the complexity, but it IS very functional and performant.

If we look at the low-level Oracle interface, there used to be three of them:

Oracle Call Interface (OCI) - This is the the interface that is most complex, but also most functional. It is a C level interface along the lines of the MySQL C API, i.e. you still pass SQL statements, it's not a file level interface of some kind. This is also the interface that is used to build Oracle Data Cartridges (i.e. the "real" Cartridges that plug into the database and are called like stored routines).
Oracle PRO* - These are the Oracle Embedded SQL interfaces. And if you think that the Oracle Embedded SQL Precompilers produce OCI code, then think again, the precompilers use a different library. But they may be used together, with some work.
Oracle SQL*Plus API. As far as I know, SQL*Plus (which is the Oracle interactive command line query tool) used a different interface from the other two guys. This was never confirmed though, so I might be wrong here and it might have changed.

Now, using any of the two public API's above, OCO or PRO* (C, Fortran etc), you nearly always use what are prepared statements in MySQL. Why? Because Oracle is pretty slow at parsing and optimizing statements, and also because that when you work with Oracle, you tend to write rather complex SQL (as the SQL Syntax is a bit more rish and because this is what a macho Oracle developer is supposed to do. Write complex SQL), which just adds to the time it takes to parse and optimize. On the other hand, once a statement is parsed, reusing it is blindingly fast in comparison. So in the case or Oracle, prepared statements are used ALL THE TIME, frankly, often you see prepared statements with bind variables being used even if you are to execute a statement just once, and this is sometimes faster than just running a single statement with the data being part of the statement, the way we mostly do it with MySQL.

So, what is the Array Interface then? This is an extension to PRO* and is also available with OCI (everything is available in OCI) that allows you to so what Ronald Bradford described in his blog, i.e. inserting multiple rows with one statement. But the Array interface does this differently, the INSERT statement looks the same, the difference is on the API side of things. What you do is to parse you statement as:
INSERT INTO t1(c1) VALUES(?)
But when you bind the ? parameter, you bind it to an array of integers, and when you execute the statement, you tell Oracle how many array records to process. This is very fast, the speedup provided, compared to executing N separate INSERT statement, is at least as big as it is with MySQL using the array interface, actually higher (As Oracle is typically not that good at executing many small statements).

Which is the better way of dealing with arrays then? If you ask me, I have a slight preference for Oracle, and let me tell you why: With MySQL, arrays are only available in this shape and form for INSERT statements, for obvious reasons. With Oracle, the same array interface is applicable also for UPDATE, DELETE and (don't hold your breath), SELECT! Yes, you can allocate an array of N rows, and then tell Oracle to fetch N rows at the time, again this is way faster than fetching row-by-row (on the other hand, the way data is fetched is very different between MySQL and Oracle. MySQL will always materialize a dataset somewhere (now I am talking API datasets, I know are many materializations going on inside MySQL also, before we reach the protocol), where as Oracle will typically not do this, at least not on the Client side of things (if this is good or bad is a different issue).

Another advantage, if you ask me, of the Oracle way of doing things, is the the SQL statement itself is intact. The disadvantages are at least two:

You have to use bind variables.
The API is a bit complex.

So, all in all, I like the idea that MySQL has some kind of array interface. The optimal would be to have an API array interface AND a SQL based (like MySQL) array interface. But maybe I'm just dreaming.

/Karlsson