April 26, 2013

Let’s do some PLAIN SQL text compression…! (Part II)

OK – It’s time to complete what we started last month...! ...and, of course, it's nice to see you back... :)

First, let's remember the challenge: Let’s do some plain SQL text compression (in CUBRID, of course..)

So you have a table which contains plain, “normal” English text, let’s say the daily employees status report and:
  • You don’t have an issue with the database response speed, but you have an issue with the big data(base) size
  • The solution you came up with is obviously to compress its content, using some text compression algorithm
  • You can’t use stored procedures! (as they are not standard plain SQL)
  • You must achieve compression only by using SQL statements and tables and what else a “standard” DMBS is offering!
  • You can use as many SQL statements as you want!
  • Once gain – you can use only standard SQL statements (including any well-known functions and data manipulation commands)

Just in case you forgot were we arrived in the first part, please make sure you remember...

The last thing we did was to create a “compressor" table, with a row number, an input word and the "compressed" value for the input word, using the following SQL to calculate the “compressed” value:

UPDATE compressor SET output_word = CONCAT(CHR(32 + CAST(SUBSTR(TO_CHAR(id, '0000'), 1,2) AS INT)), CHR(32 + CAST(SUBSTR(TO_CHAR(id, '0000'), 3,2) AS INT)))

Let's continue from there!

3. Next, I created a temporary table – “words” - to hold text-to-be-compressed words and I created the SQL statement to populate the table:

4. Then, I replaced each “known” word from the "words" table with its compressed equivalent:

5. Finally, the compressed text is created from the “words” table, using the GROUP_CONCAT function:

That's all, folks!... …I will leave the “revert”(= "uncompress") algorithm steps to you! :)

Of course, my final question (and challenge for you) is: What is your solution? Have you found a better one…? :)

See you next time – Bye-bye!

P.S. I almost forgot to tell you about a much simpler, a no-brainer solution! A solution which works, but it’s quite ugly…
Here it is: You simply execute a chained REPLACE command (using multiple SQLs or just a big-big-big one), using as many times as you want the most common used English words you choose to hard-code.

For example, you could do:

SELECT 'John Doe is the man. He is better, taller and stronger!' INTO :input_txt;
SELECT :input_txt INTO :output_txt;
SELECT REPLACE(:output_txt, '. ', '^0') INTO :output_txt;
SELECT REPLACE(:output_txt, ' of ', '^1') INTO :output_txt;
SELECT REPLACE(:output_txt, ' to ', '^2') INTO :output_txt;
SELECT REPLACE(:output_txt, ' is ', '^3') INTO :output_txt;
… and so on ...

This would obviously ensure (some) compression, and the more words you choose, the better compression you have...
...but yes, it’s far-far-far from looking good or optimal…...but in the end, if you refer strictly to the problem of achieving some level of a text compression, it is probably the simplest solution after all...

No comments:

Post a Comment