• Re: Unicode

    From Kaz Kylheku@1:261/38 to Ben Bacarisse on Wed Apr 8 19:00:56 2020
    From: Kaz Kylheku <773-297-7223@kylheku.com>

    On 2020-04-08, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
    Kaz Kylheku <773-297-7223@kylheku.com> writes:

    On 2020-04-08, stbalbach2@gmail.com <stbalbach2@gmail.com> wrote:
    Given a unicode string:

    /usr/bin/printf "\u041c\u043e\u0442\u0438\u043d"

    Result: "D£D1ÑéD,D½"

    Is there a native gawk way other than invoking /usr/bin/printf ?

    Awk doesn't have \u escapes.

    The following works for me with GNU Awk and Mawk on
    Ubuntu 18:

    $ awk 'BEGIN { print "D£D1ÑéD,D½" }'
    "D£D1ÑéD,D½"

    That is to say, the implementations appear to be 8 bit clean in the
    handling of string literals, so you can write source code in UTF-8,
    embedding the extended characters directly.

    The gawk I use (4.2.1) appears to be UTF-8 aware, not just 8-bit clean:

    $ awk '/caf[ÄCe]/ {print length($1)}'
    cafe
    4
    cafÄC
    4

    To yank your leg a bit, that could plausibly just be 8 bit ISO-Latin. ;)

    But if we try it on the original example, it does report 5 for length("D£D1ÑéD,D½").

    OTOH, mawk reports 10.

    I also tried with an older 4.1.60 that I built from sources a few years ago; that already had it working.

    Indexing works too:

    0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D£D1ÑéD,D½"; print substr(s, 2) }' D1ÑéD,D½
    0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D£D1ÑéD,D½"; print substr(s, 2, 2) }' D1Ñé

    --- BBBS/Li6 v4.10 Toy-4
    * Origin: Prism bbs (1:261/38)