From: Kaz Kylheku <
773-297-7223@kylheku.com>
On 2020-04-08, Ben Bacarisse <
ben.usenet@bsb.me.uk> wrote:
Kaz Kylheku <773-297-7223@kylheku.com> writes:
On 2020-04-08, stbalbach2@gmail.com <stbalbach2@gmail.com> wrote:
Given a unicode string:
/usr/bin/printf "\u041c\u043e\u0442\u0438\u043d"
Result: "D£D1ÑéD,D½"
Is there a native gawk way other than invoking /usr/bin/printf ?
Awk doesn't have \u escapes.
The following works for me with GNU Awk and Mawk on
Ubuntu 18:
$ awk 'BEGIN { print "D£D1ÑéD,D½" }'
"D£D1ÑéD,D½"
That is to say, the implementations appear to be 8 bit clean in the
handling of string literals, so you can write source code in UTF-8,
embedding the extended characters directly.
The gawk I use (4.2.1) appears to be UTF-8 aware, not just 8-bit clean:
$ awk '/caf[ÄCe]/ {print length($1)}'
cafe
4
cafÄC
4
To yank your leg a bit, that could plausibly just be 8 bit ISO-Latin. ;)
But if we try it on the original example, it does report 5 for length("D£D1ÑéD,D½").
OTOH, mawk reports 10.
I also tried with an older 4.1.60 that I built from sources a few years ago; that already had it working.
Indexing works too:
0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D£D1ÑéD,D½"; print substr(s, 2) }' D1ÑéD,D½
0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D£D1ÑéD,D½"; print substr(s, 2, 2) }' D1Ñé
--- BBBS/Li6 v4.10 Toy-4
* Origin: Prism bbs (1:261/38)