March 3, 2013

Let’s do some PLAIN SQL text compression…! (Part I)



OK – It’s fun time again...! :)

The challenge I got for you now is: Let’s do some plain SQL text compression (in CUBRID, of course..)!

What does this challenge means exactly…? Let’s imagine that:

  • You have a table which contains plain, “normal” English text, let’s say the daily employees status report
  • You don’t have an issue with the database response speed, but you have an issue with the big data(base) size
  • The solution you came up with is obviously to compress its content, using some text compression algorithm
  • You can’t use stored procedures! (as they are not standard plain SQL)
  • You must achieve compression only by using SQL statements and tables and what else a “standard” DMBS is offering!
  • You can use as many SQL statements as you want!
  • Once gain – you can use only standard SQL statements (including any well-known functions and data manipulation commands)

(Yes, I know! – The scenario it’s not quite for real – but please remember the fun part… …and that you will improve your SQL skills as well… :D)

Hmm – so what do you say…? Are you up for this challenge…? (And you can use whatever DBMS you want for this - CUBRID is just my obvious choice because this blog is about CUBRID…)
If you are up to the challenge, then take 10 min and think about it… After that, you should read the solution below…
When I start analyzing the problem, here is what it crossed my mind:

  • It can’t be a complicated compression algorithm, as we are limited to just some SQL statements and the standard SQL functions
  • It must be a lossless compression algorithm

After doing some research, I realized that one way to go is to implement a most “common” English words replacement algorithm, which will work like this:

  • Get a list of N most common used English words
  • Figure out a way to replace each of the word with some “shorter data” – a number or a string we define or something similar.

As usually there are just some hundreds words we used on a daily basis to communicate – there is no doubt that we can reach a compression here! The “ratio”...? Well, it depends on the text, obviously.

OK – I will spare you the details about the time I spent, searching for all kind of solutions do this in plain SQL…! :)

The big-big problem here is that in plain SQL you can’t ITERATE! There is no FOR, there is no WHILE! You can’t transform characters “one by one”…

After analyzing and analyzing again and so on, I finally found a way to do it – EURIKA! :)

Here is my solution:

  1. Store the words I intend to replace in a (dictionary words replace) table.
  2. Create a hard-coded “compressed” value for each of the words.
  3. Split the text I will compress into words and store the words in a (temporary) table, using the REPLACE function.
  4. Perform an UPDATE on the temporary table (which contains our text words), replacing each “known” word with its shorter version.
  5. Use the GROUP_CONCAT function to reassemble the new/compressed text.

And here are some details and screenshots:

1. I created a CUBRID table which hold the common-used English words. I used the lists available here.



2. I decided to use a compression algorithm based on the row number of each stored “common” word.
As these words are no more than thousands, I can simply replace each word by an integer 2-byte value, right? :D
Using this compression rule, we can go up to 256 x 256 =65.536 words! This is more than enough to compress any usual English text!
So I added 2 more columns to the “compressor" table: the row number and the compressed value for each word, using the following SQL to calculate the “compressed” value:

UPDATE compressor SET output_word = CONCAT(CHR(32 + CAST(SUBSTR(TO_CHAR(id, '0000'), 1,2) AS INT)), CHR(32 + CAST(SUBSTR(TO_CHAR(id, '0000'), 3,2) AS INT)))




...OK! ...that's enough for now... :)
The 2nd part of this post will be published next month - so you have enough time to come up with your own solution or to finish the one I started to show here...


So see you with the next part soon... Bye-bye!
/Dee