Karlsson on databases and stuff: May 2012

Thursday, May 31, 2012

An introduction to JSON

JSON is a data format that has become very popular recently. The name JSON means "JavaScript Object Notation" and it's a bit misleading, as although the format originally was used in JavaScript, it has become very popular in many more environments.

JSON is used as a format of both queries and results in database applications, as well as for data export and import. Among the most JSON centric database system, and also the most JavaScript focus one I guess, is MongoDB. In the world of NoSQL, JSON is used extensively, as JSON is really cool for representing documents and stuff like, and for a variety of other reasons too. JSON is a format that is also textual and easy to read, and the format is pretty standardized.

Drizzle has some features for working JSON, and this is a pretty good example of something that is becoming real common: REST + JSON, so a standard HTTP call using JSON for input and output. There are a whole bunch of implementations of this, so you can use curl to query Drizzle.

The best reference for JSON is probably the json.org website, but I give a brief introduction here.

To begin with, the basic datatypes in JSON (as well as in JavaScript) are string, number, boolean and NULL. There is nothing else. Also, string is UTF-8, no Latin-1 or stuff like that to handle, either it's UTF-8 or it's not JSON, except that you may put any character in a string if it is properly escaped. A number is just that, a string of numeric characters, optionally prefixed by a - (minus) sign and optionally with a decimal point. Also in a number, you may optionally use exponential notation.

Now, in JSON, everything is one an object or an array. Every object and array may contain embedded objects and arrays, so there may be indefinite levels here. Every attribute is named and the name is specified in double quotes, followed by a colon and then the value. The attributes are separated with a coma and one or more optional spaces. If the value is a string, it is embedded in double quotes.

An array is enclosed within brackets, and an object within burly braces. So with all this theory, let's look at an example, where I represent a set of users:

{"id": 1, "fname": "Joe", "lname": "Bloggs", "groups": ["admin", "guest"]}

{"fname": "Ann", "lname": "Bloggs", "groups": ["users"], "address": {"street": "Main Street 1", "city": "San Jose"}, "id": 2}

{"id": 3, "fname": "John", "lname": "Doe", "groups": ["admin"], "address": {"city": "New York"}, "active": false}

This wasn't too bad, right? Nothing terribly complex in there. This was 3 JSON records, and this is what a typical JSON formatted file might look like, for example a dump from a MongoDB database. There is one more thing to note here: The order if the attributes is undefined!

One thing to watch out for are numbers here, and how they work in MongoDB and JavaScript. The JavaScript number datatype, which again is standardized, is not really exact when it comes to big numbers, most specifically, a JavaScript number is a 8-byte floating point, which in turn means that an integer with more than some 15 digits will not be precise. This is an issue when we use integers as identifiers when we have large amounts of data or this integer is some kind of bitmap. Now, JSON doesn't have this issue, but when transferring data from say, MySQL to MongoDB, we need to sonsider it. So how does this work with MongoDB then, where we have large integers as identifiers? Well, MongoDB uses the open source SpiderMonkey JavaScript engine from Mozilla, the one you find in products such as Firefox. To work around the issue with large integers though, MongoDB adds a datatype of it's own, and some function to manage it (this is why you see NumberLong("some long number") is MongoDB sometimes). When we have a number in MySQL that we want to import into MongoDB, and that number is an integer with 16 digits or more, then we must ensure we use the MongoDB specific datatype. A MongoDB integer and a JavaScript native integer may be operated on as equals in many cases, but with large integers, the result might not be what you expect:

db.mytable.find({id: 15617798765916278})

Might not work as expected, but this might:

db.mytable.find({id: NumberLong(15617798765916278)})

I will present the first version of my JSON (MongoDB focused) to MySQL loader real soon, so don't touch that dial!

/Karlsson

Wednesday, May 30, 2012

Some JSON please

One thing still missing from the basic MySQL stuff is some more JSON oriented features. After some use of JSON (here at Recorded Future, JSON is THE data format), I have begun to accept this format, and have even begun to appreciate it. JSON really is a useful, flexible format that even it reasonably well standardized and has suport in most languages.

CSV used to the the obvious data exchange format for RDBMSes, but in the case of NoSQL, I sense that JSON is beginning to take that role, not only in the obvious JavaScript focued NoSQL databases such as Mongo. JSON is more object focused then CSV, is more standardized, has some better features and is in general more flexible.

Of course, a format with nested objects and arrays such as JSON would be hard to import into MySQL in the general case, at least if you want to expand arrays into lookup tables, and foreign keys constructs for keys etc. But export should be reasonable simple, and import of simple JSON structures should not be too hard.

While thinking about this, I realized that maybe I could be the one to create some JSON / MySQL import and export tools. Many of the solutions I have seen so far were based on some scripting language or MySQL Stored Routines (which doesn't mean other tools aren't out there, I haven't researched it that much), so I wanted something else. Why? Because NoSQL often means a lot of JSON structured data, so good performance is really important, which in turn means loading data using INSERT arrays and multple threads. Export means some means of multi-threading and possible some additional performance enhancing methods.

I would really like some input here. I have slowly started on a simple JSON loading tool for MySQL. Right now, this tool can only load flat structures, and embedded array or objects gets loaded as text. But it is multithreaded and it does load using INSERT arrays. But what other functions would you like? Data mapping? Breaking up an array into foreign keys? In one level or even in multiple levels?

In conclusion, I like JSON, I think it's a useful data format with good standardization and some modern features, and that it is well beyond, say, CSV (which isn't standardized at all and has quite a few format variatioons, for no good reason).

/Karlsson

Thursday, May 24, 2012

On simplicity, awk and potatoes

Yes, things certainly changes all the time for us IT folks. New classes of hardware, new cool pieces of software and cool gadgets. Stuff that usually gets better and better, and if not, at least they are on the move. Constantly.

2012 years new potatoes. (c) God

If it isn't Ivy Bridge based motherboards, it's "New iPad" (that is a strange name by the way, what is wrong with iPad 3? And it will not be "new" for long) or MySQL 5.6 or Ubuntu 11.10.

And then there are things that don't improve much over time, and still stays around. Sometimes because they have some powerful backers, despite it being pure evil (Adobe Flash anyone), but sometimes because it's so good that you just cannot improve on the original. Awk truly was right from the start, few improvements has been done since I first used it some 30 years ago, and it's still the same helpful tool (sure, some things have changed, but the basic concept is the same, and most of the common syntax also).

And then we have stuff that, unlike awk, isn't developing at all. No new releases, no features added, no cool beta versions, just the same plain old thing, year after year, and that's how we like it to be. For example (you MySQLers reading this already know what this is about) swedish new potatoes. I had my first for this year today. Oh my god, this is so good! With a pinch of salt and some butter, this is better than candy!

/Karlsson

A tale of a benchmark attempt - Attempt 1

Whoa, it was a long time since I posted here! But I have been very busy with non-MySQL related issues lately, so that is an explanation I guess.

This week I decided to try a few MySQL things anyway, the plan was to compare MongoDB with MySQL Cluster as a key-value store. We have some data here at Recorded Future that is currently in MongoDB, it will not fit in DynamoDB (it has secondary indexes for example) and I was thinking that maybe MySQL Cluster was an alternative, it was some time ago since I tried Cluster anyway.

At Recorded Future, we run everything on Amazon EC2, but I was thinking that this benchmark should be about another thing than just comparing MySQL Cluster with MongoDB, I wanted to see the difference between EC2 and some hard iron.

So, I started downloading some data to my recently constructed Linux server box. This server is a home brew machine housed in a Chieftech 4U rackmount box. There is Asus M5A88V EVO mobo in it, and on that there is an 8-core AMD CPU and 16 Gb RAM, so for a box at home for experimental purposes, it's quite reasonable.

What is not reasonable is how Linux folks treat hardware and user requirements sometimes. I understand that the developers of Linux to a not small extent does this in free time. Also, I understand that stuff sometimes go wrong. But hey, Ubuntu 10.10 (which we use, despite that it is old) is a pretty common Linux distro. On my Mobo there is a Gigibit LAN thingy, as on all Mobos these days, more or less. One of the most common LAN chipsets is from Realtek, either the 8111 or 8168. Seems insignificant, right? No big deal? Just install Linux and it works, Linux may have issues with very unusual hardware, but not with something as common as the Realtek 8111/8168? Beeep! Wrong answer! Sure, it works, but slowly. If you look carefully, you realize that network performance is dead slow, and further investigation shows that this is due to a lot of dropped packets.

Doing an lsmod you can see that Linux (this is Ubuntu 10.10, but this is not uncommon on other Linuxes either) has determined that it wants to use the driver for the much less common Realtek 8169 Gigibit Eithernet chip. These guys are seemingly compatible, hey it works, but it doesn't work well. Back to the drawing board, in this case: Download the module source, for the 8111/8168 then, from Realtek, make a new Linux module, remove and blacklist the r8169 module and then instead install the r8168 module. Yes, I can live with this. But those you are not developers or administrators and wants to use Linux will have issues with this. Look OSS folks, you are doing a good job, but look at your packaging and how you address users.

That said, it was back to my seemingly how 16 Gb Linux box which Linux was thinking had only 3.2 Gb available. Again, this is a Linux Kernel issue. And again it's not that uncommon, if affects certain AMD Mobos with a certain AMD Chipset. Again the patch from AMD for this is simple, but it does require patching the Linux kernel. I would expect stuff to work better than this and to be better tested, but on the other hand, my Mobo is pretty new and Ubuntu 10.10 is pretty old, so OK. But I have much less issues with hardware related stuff with my Windows machines. And before you reply that those guys are paid, I understand that, but I was hoping the power of the Linux community and the better way of developing software that OSS represents should compensate for that. But that is not so it seems, so I guess Linux stays largely between us Geeks for a while, which might be just as well, as that is how I make my money!

Oh now, what happened to the benchmark? Well, instead of benchmarking I have been busy upgrading, downgrading, rebuilding and patching Linux so this never happened. Now I do have a server where Linux can see all 16 Gb of Memory and where the network seems to work OK (I have to admit it, Realtek Sucks. I have been trying to find an alternative, but most PCI boards also have a Realtek chip on them).

But stay tuned, once my box again is properly VPNed to the Recorded Future network I'll install MongoDB again, reimport the data to it, and the convert and load the data into MySQL Cluster and then I am ready for some simple testing. But this is taking WAY longer than I expected!

/Karlsson