Forum: The Computer Express

Re: Unicode

From Kaz Kylheku@1:261/38 to Ben Bacarisse on Wed Apr 8 19:00:56 2020

From: Kaz Kylheku <773-297-7223@kylheku.com>

On 2020-04-08, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

Kaz Kylheku <773-297-7223@kylheku.com> writes:

On 2020-04-08, stbalbach2@gmail.com <stbalbach2@gmail.com> wrote:

Given a unicode string:

/usr/bin/printf "\u041c\u043e\u0442\u0438\u043d"

Result: "D�D1��D,D�"

Is there a native gawk way other than invoking /usr/bin/printf ?

Awk doesn't have \u escapes.

The following works for me with GNU Awk and Mawk on
Ubuntu 18:

$ awk 'BEGIN { print "D�D1��D,D�" }'
"D�D1��D,D�"

That is to say, the implementations appear to be 8 bit clean in the
handling of string literals, so you can write source code in UTF-8,
embedding the extended characters directly.

The gawk I use (4.2.1) appears to be UTF-8 aware, not just 8-bit clean:

$ awk '/caf[�Ce]/ {print length($1)}'
cafe
4
caf�C
4

To yank your leg a bit, that could plausibly just be 8 bit ISO-Latin. ;)

But if we try it on the original example, it does report 5 for length("D�D1��D,D�").

OTOH, mawk reports 10.

I also tried with an older 4.1.60 that I built from sources a few years ago; that already had it working.

Indexing works too:

0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D�D1��D,D�"; print substr(s, 2) }' D1��D,D�
0:sun-go:~/gawk$ ./gawk 'BEGIN { s = "D�D1��D,D�"; print substr(s, 2, 2) }' D1��

--- BBBS/Li6 v4.10 Toy-4
* Origin: Prism bbs (1:261/38)

Who's Online
Recent Visitors
- Guest
  Tue Jul 1 16:20:52 2025
  from System via Raw
- Guest
  Tue Jul 1 16:16:33 2025
  from System via Raw
- Guest
  Tue Jul 1 15:15:15 2025
  from System via Raw
- Guest
  Sat Jun 28 15:28:02 2025
  from System via Raw

System Info

Sysop:	Coz
Location:	Anoka, MN
Users:	2
Nodes:	4 (0 / 4)
Uptime:	61:34:22
Calls:	340
Files:	5,987
Messages:	231,796

Re: Unicode

Who's Online

Recent Visitors

System Info