Best short program

Seonghoon Kang
kang.seonghoon@mearie.org

Judges' comments:

To build:

make kang

To run:

echo "full spelling of an English cardinal numeral less than a quadrillion" | ./kang

Try:

echo Nineteen hundred and eighty-four | ./kang
echo uno | ./kang
echo trois | ./kang
echo fier | ./kang
echo "shest'" | ./kang

Selected Judges Remarks:

The judges were able to appreciate the Indo-European language family relation by making this entry successfully recognize some French, German, Italian, Russian, and Spanish numerals.

Also worth mentioning is this entry’s ability to understand the colloquial year numbers of the last millennium.

We’ve added a linefeed to the print format for convenience.

Author’s comments:

Synopsis

This short program reads a spelt number (e.g. forty-two) and writes a corresponding decimal number (e.g. 42). Too long for one-liners, alas, but it still qualifies as a short program as it has less than 0x100 bytes.

It accepts a variety of spelt numbers:

It correctly handles zero.
Hyphen does not make a difference: forty-two and forty two are same. So does period or comma.
Cases do not make a difference either: TWO, Two, two are same.
one and a are interchangeable: one hundred and a hundred are same.
and is optional: one hundred twenty-three and one hundred and twenty-three are same.
It supports every non-negative integer less than 10¹⁵-1. It uses the small scale (i.e. American): billion is 10⁹ and trillion is 10¹².
Sometimes, it can magically correct typos.

It does not accept some spelt numbers, which I found mostly irrelevant:

A bare hundred, thousand etc. do not work.
one million million does not work. Get used to one trillion!

Requirements

This program is quite portable, only requiring the following:

The signature int main(int, int) should be accepted by the linker. (Original version only)
char should be at least 8 bits long (as dictated by the standard), int should be at least 32 bits long, long long should be at least 64 bits long.
Both the compiler and execution environment should use an ASCII-compatible character set and two’s complement representation.
Overflow and underflow on char should wrap around, if your char is unsigned.
A trustworthy compiler.

The design of the program explicitly allows for EOF which does not equal to -1 (it has to be negative per the standard) and both signed and unsigned char, for example. Furthermore it is endian-independent.

Obfuscations (SPOILERS!)

Many obfuscations used are typical for standard IOCCC entries:

Two arguments from main function are reused as normal variables.
Every conditional has been replaced with ?: ternary operator and || short-circuiting operator.
It has exactly three nested for loops and nothing else.
Common two’s complement tricks: ~-a instead of a-1, ~a?...:... instead of a!=-1?...:..., etc.
Comma operators (,) for multiple statements. The number of them is minimized, however, as it is too easy to (ab)use them.
It lacks most parentheses around bitwise and arithmetic operators. It was originally written for shortness so parentheses were EVIL.
n["string"] instead of "string"[n]. Both are fine for this program but I went to the former just for fun.
Utter lack of any kind of layouts. (Oh, except for the first column.)

Other obfuscations are more subtle:

The string "1+DIY/.K430x9G(kC[" is 18 bytes long, but actually 19 bytes including the final null character are used.
It internally represents numbers as hexadecimal. When the input is two hundred and three, it actually writes 0x203 as hexadecimal.
Some variables (notably, n) have dual uses.
The magic number 42 makes an appearance.
It has a long long numb-main-er within it!

But the most important obfuscation is the clever construction of lookup table. The program uses 11 different characters required for recognizing 22 lexemes:

zero        one         tw-         th(i)r-     fo(u)r-     fi-         six-
seven-      eigh-       nin-        ten         eleven      twelve
hundred(s)  thousand(s) million(s)  billion(s)  trillion(s)
a           and         -teen       -ty

So that they are internally represented as like:

r        n        tw-      tr-      fr-      f-       s-
sn-      g-       nn-      tn       ln       twl
nr(s)    tsan(s)  lln(s)   blln(s)  trlln(s)
a        an       -tn      -ty

The stemmer recognizes the longest matching prefix, so every lexeme can be recognized by at most three characters (e.g. trl instead of trlln). This is also handy for ignoring plurals. But that would make that the table does not fit in the printable byte—11² is already almost 2⁷!

The trick is to use octal; three characters (a, b and g) are interpreted as sequences of two characters (ny, nn and nw respectively). Asides from a smaller lookup table, it has many good consequences:

Both a and and share the common prefix, ny, and can be discarded altogether. Note that ny itself does not appear from other entries.
thousand is interpreted as tsan, which is equivalent to tfyn in the octal scheme. As it is the only entry with tf prefix, it can be shorten by one character.

Having said this important trick, other details should be relatively easier to follow. The order of lookup table, for example, is very important, and the biggest constant 6177 is not arbitrarily chosen.

Acknowledgement

The cleaner (size-optimized) version of this program was originally published in my website in July 2011. Sun Park and others have reviewed it and let me aware of possible improvements. I’d also like to thank Seo Sanghyeon for proof-reading remarks.