Forum: The Computer Express

Weird code crash

From The Natural Philosopher@3:770/3 to All on Thu Sep 14 06:23:14 2023

XPost: comp.os.linux.misc

I don't expect people to know the answer, but I could use some help in
puzzling out where to look.

I had a power cut that did leave my network a bit sketchy and it took
two reboots on this desktop to get back to normal. This may or may not
be relevant.

But my question refers to my Pi Zero W server I am developing.

It came up, ok, but then after a while my relay daemon crashed...

Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
15.074s
CPU time.

I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.

I restarted it manually, and it hasn't crashed since.

The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic
message possibly to do with memory issues.

One person 'fixed' it by changing CPUs...
Now *as far as I know* there was nothing special about the data the
daemon would be operating on it this point to cause it to crash. I am
fairly sure I have no memory leaks in it - in normal operation it
strdups() and frees() and opens and closes files... and 'top' shows
memory usage is rock steady.

One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file

WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}

Could this cause a problem?

I tend to suspect some sort of asynchronous timing issue because it is
such a rare occurrence. I have been utterly unable to make it happen on demand...

--
A lie can travel halfway around the world while the truth is putting on
its shoes.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ahem A Rivet's Shot@3:770/3 to The Natural Philosopher on Thu Sep 14 07:09:14 2023

XPost: comp.os.linux.misc

On Thu, 14 Sep 2023 06:23:15 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:

One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");

Anything opened with fopen is a buffered stream operations on it
are not atomic so yes it is very possible for the read to see a partially written file. To avoid the race you need to use some kind of locking.

--
Steve O'Hara-Smith
Odds and Ends at http://www.sohara.org/
Host: Beautiful Theory meet Inconvenient Fact
Obit: Beautiful Theory died today of factual inconsistency

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ahem A Rivet's Shot on Thu Sep 14 07:57:44 2023

XPost: comp.os.linux.misc

On 14/09/2023 07:09, Ahem A Rivet's Shot wrote:

On Thu, 14 Sep 2023 06:23:15 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:

One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write
operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");

Anything opened with fopen is a buffered stream operations on it
are not atomic so yes it is very possible for the read to see a partially written file. To avoid the race you need to use some kind of locking.

Hmm.

Howver I think that for small operations one would have to posit a time
between fopen() and fread() in which the file 'disappears' in some
sense. Burt I 8thought* that a file handle once issued would not point
to empty data, and that in fact fopen('w") would in fact create a new
file and the old would not get unlinked until it was 'fclosed'
--
"Corbyn talks about equality, justice, opportunity, health care, peace, community, compassion, investment, security, housing...."
"What kind of person is not interested in those things?"

"Jeremy Corbyn?"

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to The Natural Philosopher on Thu Sep 14 08:45:58 2023

XPost: comp.os.linux.misc

The Natural Philosopher <tnp@invalid.invalid> writes:

READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file

There’s no error checking on the call to fopen, so fp could be a null
pointer when you call fread. So crashes are to be expected, although in
this code fragment a SIGSEGV would be expected rather than SIGABRT.

WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}

Could this cause a problem?

I tend to suspect some sort of asynchronous timing issue because it is
such a rare occurrence. I have been utterly unable to make it happen
on demand...

Investigate properly first (see Theo’s post), guess about the cause
later.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 08:36:06 2023

XPost: comp.os.linux.misc

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
CPU time.

I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.

I restarted it manually, and it hasn't crashed since.

The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.

You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a
double-free.

I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

$ gdb ./myprog
(gdb) run

and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:

(gdb) bt

It may be that starting it under systemd is different in some way that it doesn't show up when running it by hand. You could try setting as your
systemd command:

gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

which will run it and then dump a backtrace when it's finished. You may get 'no stack' if it succeeded and didn't record one.

Theo

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Tauno Voipio@3:770/3 to The Natural Philosopher on Thu Sep 14 10:55:34 2023

XPost: comp.os.linux.misc

The first try should be to check if the system runs fine from a
backup memory card (you have it?).

It is fairly possible that the memory card has some flipped bits,
and the effects are hard to predict.

--

-TV

On 14.9.2023 8.23, The Natural Philosopher wrote:

I don't expect people to know the answer, but I could use some help in puzzling out where to look.

I had a power cut that did leave my network a bit sketchy and it took
two reboots on this desktop to get back to normal. This may or may not
be relevant.

But my question refers to my Pi Zero W server I am developing.

It came up, ok, but then after a while my relay daemon crashed...

Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
CPU time.

I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.

I restarted it manually, and it hasn't crashed since.

The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.

One person 'fixed' it by changing CPUs...
Now *as far as I know* there was nothing special about the data the
daemon would be operating on it this point to cause it to crash. I am
fairly sure I have no memory leaks in it - in normal operation it
strdups() and frees() and opens and closes files... and 'top' shows
memory usage is rock steady.

One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file

WRITE
=====
fp=fopen(filename, "w");
if (fp)
    {
    fprintf(fp,"%s%s\n",filedata,timestamp);
    fclose(fp);
    }

Could this cause a problem?

I tend to suspect some sort of asynchronous timing issue because it is
such a rare occurrence. I have been utterly unable to make it happen on demand...

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ahem A Rivet's Shot@3:770/3 to The Natural Philosopher on Thu Sep 14 08:52:36 2023

XPost: comp.os.linux.misc

On Thu, 14 Sep 2023 07:57:45 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:

Howver I think that for small operations one would have to posit a time between fopen() and fread() in which the file 'disappears' in some
sense. Burt I 8thought* that a file handle once issued would not point
to empty data, and that in fact fopen('w") would in fact create a new
file and the old would not get unlinked until it was 'fclosed'

Nope - from man fopen

“w” Open for writing. The stream is positioned at the beginning of
the file. Truncate the file to zero length if it exists or
create the file if it does not exist.

--
Steve O'Hara-Smith
Odds and Ends at http://www.sohara.org/
Host: Beautiful Theory meet Inconvenient Fact
Obit: Beautiful Theory died today of factual inconsistency

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to Theo on Thu Sep 14 09:23:00 2023

XPost: comp.os.linux.misc

Theo <theom+news@chiark.greenend.org.uk> writes:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
15.074s
CPU time.

I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.

I restarted it manually, and it hasn't crashed since.

The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic
message possibly to do with memory issues.

You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.

I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

$ gdb ./myprog
(gdb) run

and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:

(gdb) bt

It may be that starting it under systemd is different in some way that it doesn't show up when running it by hand. You could try setting as your systemd command:

gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

which will run it and then dump a backtrace when it's finished. You may get 'no stack' if it succeeded and didn't record one.

Also:

* I would also have a look at the kernel log; if it’s a kernel-generated
signal then there’s usually a log message about it.

* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it
is a memory corruption issue then it may identify where the corruption
happens, rather than the later point where malloc failed a consistency
check (or whatever it is).

Using valgrind (and/or compiler sanitizer features) is a good idea even
before running into trouble, really.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ahem A Rivet's Shot on Thu Sep 14 12:27:52 2023

XPost: comp.os.linux.misc

On 14/09/2023 08:52, Ahem A Rivet's Shot wrote:

On Thu, 14 Sep 2023 07:57:45 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:

Howver I think that for small operations one would have to posit a time
between fopen() and fread() in which the file 'disappears' in some
sense. Burt I 8thought* that a file handle once issued would not point
to empty data, and that in fact fopen('w") would in fact create a new
file and the old would not get unlinked until it was 'fclosed'

Nope - from man fopen

“w” Open for writing. The stream is positioned at the beginning of
the file. Truncate the file to zero length if it exists or
create the file if it does not exist.

Ok, so there is a finite choice that an empty (zero length) file might
be read.
That is worth checking .

--
"A point of view can be a dangerous luxury when substituted for insight
and understanding".

Marshall McLuhan

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Richard Kettlewell on Thu Sep 14 12:54:38 2023

XPost: comp.os.linux.misc

On 14/09/2023 09:23, Richard Kettlewell wrote:

Theo <theom+news@chiark.greenend.org.uk> writes:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
15.074s
CPU time.

I rebooted it, and after awhile - about ten minutes, it happened again - >>> that is the above trace.

I restarted it manually, and it hasn't crashed since.

The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic
message possibly to do with memory issues.

You're getting SIGABRT which is typically something bailing due to memory
corruption, eg corrupting metadata so that malloc can't work, or a
double-free.

I would compile it with debugging enabled: '-g' or '-ggdb' flag to your
compiler. Then run it under gdb:

$ gdb ./myprog
(gdb) run

and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:

(gdb) bt

It may be that starting it under systemd is different in some way that it
doesn't show up when running it by hand. You could try setting as your
systemd command:

gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

which will run it and then dump a backtrace when it's finished. You may get >> 'no stack' if it succeeded and didn't record one.

Also:

* I would also have a look at the kernel log; if it’s a kernel-generated
signal then there’s usually a log message about it.

Nothing in kern.log after the boot process finishes.

* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it
is a memory corruption issue then it may identify where the corruption
happens, rather than the later point where malloc failed a consistency
check (or whatever it is).

Using valgrind (and/or compiler sanitizer features) is a good idea even before running into trouble, really.

The strange thing is that it failed once after a minute, then I rebooted
and it failed after 20 minutes, and its been running several days now
with no issues at all.

I am not sure valgrind would actually help unless it failed.
--
No Apple devices were knowingly used in the preparation of this post.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From candycanearter07@3:770/3 to Theo on Thu Sep 14 07:47:30 2023

XPost: comp.os.linux.misc

On 9/14/23 02:36, Theo wrote:

You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.

I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

$ gdb ./myprog
(gdb) run

and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:

(gdb) bt

If you have coredumps enabled, you could also do coredumpctl debug to
enter a gdb session of the last coredump that happened.

--
--
user <candycane> is generated from /dev/urandom

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 14:59:40 2023

XPost: comp.os.linux.misc

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

The strange thing is that it failed once after a minute, then I rebooted
and it failed after 20 minutes, and its been running several days now
with no issues at all.

I am not sure valgrind would actually help unless it failed.

valgrind will tell you if it spots memory corruption, even if the corruption
is not yet enough to cause it to crash. It may help in making the problem clearer and deterministic where the corruption makes it unpredictable.

Theo

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Theo on Thu Sep 14 16:25:14 2023

XPost: comp.os.linux.misc

On 14/09/2023 14:59, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

The strange thing is that it failed once after a minute, then I rebooted
and it failed after 20 minutes, and its been running several days now
with no issues at all.

I am not sure valgrind would actually help unless it failed.

valgrind will tell you if it spots memory corruption, even if the corruption is not yet enough to cause it to crash. It may help in making the problem clearer and deterministic where the corruption makes it unpredictable.

Theo

I am wondering if the real reason is, that I trod on it. It is so
utterly random that I am thinking that there may be a hardware issue
like a cracked board. I wrecked the USB power socket for sure.

Well a new untrodden on Pi is not the bank breaker that it might be....

Thanks for all the helpful comments, but I am not ready to delve into
reams of stack traces just yet.

I think watch and see and then maybe try another board.

--
When plunder becomes a way of life for a group of men in a society, over
the course of time they create for themselves a legal system that
authorizes it and a moral code that glorifies it.

Frédéric Bastiat

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ralf Fassel on Thu Sep 14 16:35:30 2023

XPost: comp.os.linux.misc

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after
an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist.

The code runs as root, so there are no perms issues

I've put in checks to avoid trying to read empty files

I am leaning towards possibly a cracked solder joint or board.

--
The New Left are the people they warned you about.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From nev young@3:770/3 to The Natural Philosopher on Thu Sep 14 17:16:22 2023

XPost: comp.os.linux.misc

On 14/09/2023 06:23, The Natural Philosopher wrote:

I don't expect people to know the answer, but I could use some help in puzzling out where to look.

One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file

Elsewhere in this thread it is suggested checking fp!=nul.
Not knowing what the actual program is doing might I suggest also
closing fp after it has been read.

WRITE
=====
fp=fopen(filename, "w");
if (fp)
    {
    fprintf(fp,"%s%s\n",filedata,timestamp);
    fclose(fp);
    }

--
Nev
It causes me a great deal of regret and remorse
that so many people are unable to understand what I write.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Thu Sep 14 17:29:46 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.

R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 13:44:16 2023

XPost: comp.os.linux.misc

On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after
an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist.

The code runs as root, so there are no perms issues

I've put in checks to avoid trying to read empty files

I am leaning towards possibly a cracked solder joint or board.

Have you run fsck on the file system since the power loss? Make sure the fstab entry does not have a zero in the sixth field for the file system(s) in use.
If using systemd, run dracut -f after any fstab changes. Then reboot.

Regards, Dave Hodgins

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to David W. Hodgins on Thu Sep 14 19:42:26 2023

XPost: comp.os.linux.misc

On 14/09/2023 18:44, David W. Hodgins wrote:

On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem >>> if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after
an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist.

The code runs as root, so there are no perms issues

I've put in checks to avoid trying to read empty files

I am leaning towards possibly a cracked solder joint or board.

Have you run fsck on the file system since the power loss? Make sure the fstab
entry does not have a zero in the sixth field for the file system(s) in
use.
If using systemd, run dracut -f after any fstab changes. Then reboot.

Regards, Dave Hodgins

I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

(The sixth fields are 2 and 1 respectively for the file systems)

PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2 PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

--
Canada is all right really, though not for the whole weekend.

"Saki"

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 14:53:20 2023

XPost: comp.os.linux.misc

On Thu, 14 Sep 2023 14:42:27 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 18:44, David W. Hodgins wrote:

On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher
<tnp@invalid.invalid> wrote:

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem >>>> if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after >>> an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist. >>>
The code runs as root, so there are no perms issues

I've put in checks to avoid trying to read empty files

I am leaning towards possibly a cracked solder joint or board.

Have you run fsck on the file system since the power loss? Make sure the
fstab
entry does not have a zero in the sixth field for the file system(s) in
use.
If using systemd, run dracut -f after any fstab changes. Then reboot.

Regards, Dave Hodgins

I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

(The sixth fields are 2 and 1 respectively for the file systems)

PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2 PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

Does it use systemd? If so, confirm it was clean with
"journalctl -b --no-h|grep fsck"

Regards, Dave Hodgins

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to nev young on Thu Sep 14 19:38:02 2023

XPost: comp.os.linux.misc

On 14/09/2023 17:16, nev young wrote:

On 14/09/2023 06:23, The Natural Philosopher wrote:

I don't expect people to know the answer, but I could use some help in
puzzling out where to look.

One possibility is that it is opening and reading a file at the
precise time another process is writing it...in both cases the read
and write operations are atomic and done with C code.

READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file

Elsewhere in this thread it is suggested checking fp!=nul.
Not knowing what the actual program is doing might I suggest also
closing fp after it has been read.

both already done. Not closng it was the cause of a memory leak but I
fixed that a fortnight ago.

I am beginning to wonder if I did more damage than just the power socket
when I trod on it.

WRITE
=====
fp=fopen(filename, "w");
if (fp)
     {
     fprintf(fp,"%s%s\n",filedata,timestamp);
     fclose(fp);
     }

--
Canada is all right really, though not for the whole weekend.

"Saki"

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to David W. Hodgins on Thu Sep 14 19:57:36 2023

XPost: comp.os.linux.misc

On 14/09/2023 19:53, David W. Hodgins wrote:

journalctl -b --no-h|grep fsck

Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication Socket. Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files, 460111/3822976 blocks
Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
sector and its backup.
Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences: (offset:original/backup)
Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
25815/130554 clusters
Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

--
“Those who can make you believe absurdities, can make you commit atrocities.”

― Voltaire, Questions sur les Miracles à M. Claparede, Professeur de Théologie à Genève, par un Proposant: Ou Extrait de Diverses Lettres de
M. de Voltaire

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From candycanearter07@3:770/3 to The Natural Philosopher on Thu Sep 14 14:40:56 2023

XPost: comp.os.linux.misc

On 9/14/23 13:42, The Natural Philosopher wrote:

I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

Pretty sure it's a standard, my arch install has it set.

(The sixth fields are 2 and 1 respectively for the file systems)

PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2
PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

1 is fsck check for the root partition and 2 is for others, right

--
--
user <candycane> is generated from /dev/urandom

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 15:57:08 2023

XPost: comp.os.linux.misc

On Thu, 14 Sep 2023 14:57:36 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 19:53, David W. Hodgins wrote:

journalctl -b --no-h|grep fsck

Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication Socket. Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files, 460111/3822976 blocks
Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
sector and its backup.
Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences: (offset:original/backup)
Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
25815/130554 clusters
Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

If there are any corrupted files, diagnosing any problems they cause will be difficult. I strongly recommend re-installing.

Regards, Dave Hodgins

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 21:51:28 2023

XPost: comp.os.linux.misc

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

both already done. Not closng it was the cause of a memory leak but I
fixed that a fortnight ago.

I am beginning to wonder if I did more damage than just the power socket
when I trod on it.

SIGABRT is a problem in your code. If you aren't seeing stuff in the kernel log then it almost certainly isn't a hardware fault. It is a very special skill to have a hardware fault without spewing lots of stuff there.

Post the code somewhere and someone can take a look. Otherwise you need to
use the development tools available to you to debug the problem.

Theo

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Robert Riches@3:770/3 to The Natural Philosopher on Fri Sep 15 00:40:18 2023

On 2023-09-14, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after
an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist.

Maybe not applicable in this situation, but if something deleted
the file between the time of the scan and the time of the fopen
call, it might/would not exist.

--
Robert Riches
spamtrap42@jacob21819.net
(Yes, that is one of my email addresses.)

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to David W. Hodgins on Fri Sep 15 08:20:54 2023

XPost: comp.os.linux.misc

"David W. Hodgins" <dwhodgins@nomail.afraid.org> writes:

The Natural Philosopher <tnp@invalid.invalid> wrote:

I am leaning towards possibly a cracked solder joint or board.

Again, I agree with Theo. Reported behavior is not really consistent
with a hardware fault.

Have you run fsck on the file system since the power loss? Make sure the fstab
entry does not have a zero in the sixth field for the file system(s) in use. If using systemd, run dracut -f after any fstab changes. Then reboot.

Reported behavior is also not consistent with a corrupt filesystem.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 08:30:24 2023

XPost: comp.os.linux.misc

The Natural Philosopher <tnp@invalid.invalid> writes:

On 14/09/2023 09:23, Richard Kettlewell wrote:

Also:
* I would also have a look at the kernel log; if it’s a
kernel-generated signal then there’s usually a log message about it.

Nothing in kern.log after the boot process finishes.

Most likely a bug in your program then.

* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it
is a memory corruption issue then it may identify where the corruption
happens, rather than the later point where malloc failed a consistency
check (or whatever it is).

Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.

The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.

I am not sure valgrind would actually help unless it failed.

It’s extremely good at identifying memory corruption even in cases where
that doesn’t immediately lead to a crash; that’s what it’s for. But if it doesn’t, you leave it running until the crash happens.

Up to you, of course, whether you use the tools available, or debug with
one hand tied behind your back.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 11:11:00 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| On 14/09/2023 16:29, Ralf Fassel wrote:
| > * The Natural Philosopher <tnp@invalid.invalid>
| > | One possibility is that it is opening and reading a file at the
| > | precise time another process is writing it...in both cases the read
| > | and write
| > | operations are atomic and done with C code.
| >>
| > | READ
| > | ====
| > | fp=fopen(fullname, "r");
| > | len=fread(filbuf,1,255,fp); // read entire file
| > Check for fp != NULL is missing here in this example code before
| > fread(). If this also in the production version, it might be a problem
| > if the file is not accessible for any reason.
| > R'
| Ralf, I already put that in this morning, re compiled the code and
| after an hour, it crashed again.

| The filename is built by scanning a directory so the filename must exist.

That assumption does not hold. Since scanning and opening are separated
by a time gap (albeit a 'small' one), there is a non-zero chance that
the file vanished between scan and open.

Further possibilities:
- how is 'filbuf' used after the fread()? If you use it as C-string, make
sure it is 0-terminated (fread() won't do that for you). Maybe use
fgets(3) instead?

| I am leaning towards possibly a cracked solder joint or board.

Well, since the Raspi is cheap, that should be easily checked by simply
using another one. I bet 1 beer that it is *not* a cracked board, since
with that many more processes should run into trouble, not only this
particular one.

R' (.sig not from me .-)
--
echo '[ bottles of beer]sa[ bottle of beer]sb[ take one down, pass it around ]sd[ on the wall]sc[no more]se99snlc[lalnpsnPplalnp1-snpldPln1=ylnpsnPp[]pst ln0<x]sx[salblnpsnPplblnpsnpldPleplaPlcpq]sylxx' | dc

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to David W. Hodgins on Fri Sep 15 10:15:40 2023

XPost: comp.os.linux.misc

On 14/09/2023 20:57, David W. Hodgins wrote:

On Thu, 14 Sep 2023 14:57:36 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 19:53, David W. Hodgins wrote:

journalctl -b --no-h|grep fsck

Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication
Socket.
Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files,
460111/3822976 blocks
Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
sector and its backup.
Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences:
(offset:original/backup)
Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly
unmounted and some data may be corrupt.
Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
25815/130554 clusters
Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

If there are any corrupted files, diagnosing any problems they cause
will be
difficult. I strongly recommend re-installing.

Regards, Dave Hodgins

If it persists I may do that, but now it is been rock steady for 20 hours.

The actual code has been replaced because I recompiled it anyway, but
the problem persisted after that.

Then I twisted the board a bit, and now it hasn't failed since, No
guarantees of course.

Does anyone else remember Tracy Kidder's 'Soul of a New Machine'* where
they had a wire wrapped backplane on the prototype and a strange
intermittent bug? And the director came in, twisted the backplane and
the bug instantly reappeared?

One of the more curious 'bugs' I encountered was early in my software
career, when code that I wrote suddenly went crazy, in a way in which
the actual software as written could not possibly have caused. And only
on one machine, equipped with a custom video capture card. We removed
the card, but it made no difference.

I then compared the code on the machine with the code as compiled. Two
bytes were FFH

I burned a new floppy and transferred the code again, and the code ran correctly.

Then we reinstalled the video card. The code ran correctly. Then we
copied over the code again with the video card installed. The code again
was corrupted.

Then the hardware guys looked at the address decide in the video card.
It was a mass of gates one after the other. The total delay was well out
of spec. It dawned on us that what was happening was that the DMA
controller on the floppy was using bus addresses that were being decoded
by the card, and then the IO request came along to access the floppy and
those addresses were still on the bus as far as the sluglike video card
was concerned, so it grabbed the data bus and shoved FFH on it.

Hardware is not perfect. That is the lesson. And chasing software when
its really hardware is a losing game.

Anyway, I have in reserve all the great techniques suggested, but for
now I am playing a wait and see game to see if any pattern emerges. My experience suggests that the same code running a loop in the same memory
wont crash and burn unless there is a malloc/free mismatch, and that
happens fairly quickly and shows in 'top'.

This kind of weird utterly asynchronous behaviour is often hardware.
And. since I trod on the bloody PCB, I may simple get another one and
test that. It doesn't need to be installed till winter. There is time.
And my PCB design for the relay and PSU module isn't back from China yet...

*https://en.wikipedia.org/wiki/The_Soul_of_a_New_Machine . Definitely recommended if you haven't read it.

--
"When one man dies it's a tragedy. When thousands die it's statistics."

Josef Stalin

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to All on Fri Sep 15 10:16:34 2023

XPost: comp.os.linux.misc

On 14/09/2023 20:40, candycanearter07 wrote:

On 9/14/23 13:42, The Natural Philosopher wrote:

I assumed that the thing would have done its own fsck on every boot
anyway...isnt that a debian default?

Pretty sure it's a standard, my arch install has it set.

(The sixth fields are 2 and 1 respectively for the file systems)

PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2
PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

1 is fsck check for the root partition and 2 is for others, right

I looked it up, it merely specifies the order I think, so you are right
in practice.

--
"Corbyn talks about equality, justice, opportunity, health care, peace, community, compassion, investment, security, housing...."
"What kind of person is not interested in those things?"

"Jeremy Corbyn?"

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Robert Riches on Fri Sep 15 10:23:40 2023

On 15/09/2023 01:40, Robert Riches wrote:

On 2023-09-14, The Natural Philosopher <tnp@invalid.invalid> wrote:

On 14/09/2023 16:29, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.

| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file

Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.

R'

Ralf, I already put that in this morning, re compiled the code and after
an hour, it crashed again.

The filename is built by scanning a directory so the filename must exist.

Maybe not applicable in this situation, but if something deleted
the file between the time of the scan and the time of the fopen
call, it might/would not exist.

Exactly. That is a possibility, which I have now covered. It made no difference.

In practice the write code that *replaces* the file is very simple. It is fopen( "w") immediately followed by
fwrite()

without knowing the exact code involved with the fopen("w"); I cant say
if that actually deletes the file and creates a new one, or merely
truncates it to zero length, or indeed just opens it and trips the
length *after* the new data is written..

--
WOKE is an acronym... Without Originality, Knowledge or Education.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 10:27:16 2023

XPost: comp.os.linux.misc

On 14/09/2023 21:51, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

both already done. Not closng it was the cause of a memory leak but I
fixed that a fortnight ago.

I am beginning to wonder if I did more damage than just the power socket
when I trod on it.

SIGABRT is a problem in your code.

Very definite.

Are you sure about that?

If you aren't seeing stuff in the kernel

log then it almost certainly isn't a hardware fault. It is a very special skill to have a hardware fault without spewing lots of stuff there.

Even a corrupted bit in a ram disk?

Post the code somewhere and someone can take a look. Otherwise you need to use the development tools available to you to debug the problem.

I can post the code, but it may not help. You need the whole system
including the perpiherals that write, to the daemon that writes the data
files that the daemon that crashes reads.

At the moment it is behaving perfectly. Without a reproducible bug I can
see no point in using a debugger.

Theo

--
There is nothing a fleet of dispatchable nuclear power plants cannot do
that cannot be done worse and more expensively and with higher carbon
emissions and more adverse environmental impact by adding intermittent renewable energy.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 10:46:44 2023

XPost: comp.os.linux.misc

On 15/09/2023 08:30, Richard Kettlewell wrote:

The Natural Philosopher <tnp@invalid.invalid> writes:

On 14/09/2023 09:23, Richard Kettlewell wrote:

Also:
* I would also have a look at the kernel log; if it’s a
kernel-generated signal then there’s usually a log message about it. >>>

Nothing in kern.log after the boot process finishes.

Most likely a bug in your program then.

* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it >>> is a memory corruption issue then it may identify where the corruption >>> happens, rather than the later point where malloc failed a consistency >>> check (or whatever it is).

Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.

The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.

I am not sure valgrind would actually help unless it failed.

It’s extremely good at identifying memory corruption even in cases where that doesn’t immediately lead to a crash; that’s what it’s for. But if it doesn’t, you leave it running until the crash happens.

Well that is an option for sure.

Up to you, of course, whether you use the tools available, or debug with
one hand tied behind your back.

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

The problem is that this thing is looping very frequently.
loop()
{
while (1)
{
int i;
readThermometers();
readZones();
readOverrides();
readTimerData();
setRelayState();
setRelays();
usleep (1120000);
}
}

And that means thousands of faultless iterations in a day.

So this bug ( if it is a bug) is a one in a million or worse.

I suppose I could make the thing loop ten times a second (or even
faster) and see if it happens more often..

its not as though its chewing up CPU...

The problem I have is that these crashes only recently started
happening: prior to that the code ran for days. And two things happened,
a massive brownout, and then a full power cut, and I trod on it.

And I made systemd start it...

I see it crashed again last night, again with zero errors apart from
SIGABRT...

I will start it manually and cut systemd out.

--
The lifetime of any political organisation is about three years before
its been subverted by the people it tried to warn you about.

Anon.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 11:19:10 2023

XPost: comp.os.linux.misc

On 15/09/2023 10:11, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| On 14/09/2023 16:29, Ralf Fassel wrote:
| > * The Natural Philosopher <tnp@invalid.invalid>
| > | One possibility is that it is opening and reading a file at the
| > | precise time another process is writing it...in both cases the read
| > | and write
| > | operations are atomic and done with C code.
| >>
| > | READ
| > | ====
| > | fp=fopen(fullname, "r");
| > | len=fread(filbuf,1,255,fp); // read entire file
| > Check for fp != NULL is missing here in this example code before
| > fread(). If this also in the production version, it might be a problem
| > if the file is not accessible for any reason.
| > R'
| Ralf, I already put that in this morning, re compiled the code and
| after an hour, it crashed again.

| The filename is built by scanning a directory so the filename must exist.

That assumption does not hold. Since scanning and opening are separated
by a time gap (albeit a 'small' one), there is a non-zero chance that
the file vanished between scan and open.

Further possibilities:
- how is 'filbuf' used after the fread()? If you use it as C-string, make
sure it is 0-terminated (fread() won't do that for you). Maybe use
fgets(3) instead?

dir = opendir(VOLATILE_DIR);

if(!dir)
return;
while ((dp = readdir (dir)) != NULL)
{
filename=dp->d_name;
// skip known bollocks
if(!strcmp(filename, "." ) || !strcmp(filename, ".." ) || !strcmp(filename, "relays.dat" ))
continue;
// construct full path
sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);
stat(fullname,&stats);// get tfile times
if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
continue;
len=strlen(filename);
if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
continue;
fp=fopen(fullname, "r");
if(fp==0) //file has disappeared?
continue;
len=fread(filbuf,1,255,fp);
if(len==0) // file has zero length
goto baddata;
filbuf[len]=0;
if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject a file whose
contents do not start with ZONE
goto baddata;

// looking very much like a temperature file
i=(int)filbuf[4] -'1'; // this is our zone from "ZONE2" etc. 1-4 is
zone but index is 0-3 so subtract '1'
p=strstr(filbuf,"\n");
if(p)
{
p++;
if(q=strstr(p,"\n"))
{
*q++=0;
thermometers[i].name=strdup(p); // make a copy of the name and
attach it to our thermometer structure
p=q;
}
else goto baddata;
// now to fetch the temp data.
if(q=strstr(p,"\n"))
{
*q++=0;
thermometers[i].temp=atof(p);
p=q;
}
else goto baddata;
// what's left is the voltage. To hell with any crap after it
thermometers[i].battery=atof(p);
}
baddata:fclose(fp);
} // end of directory scan loop

| I am leaning towards possibly a cracked solder joint or board.

Well, since the Raspi is cheap, that should be easily checked by simply
using another one. I bet 1 beer that it is *not* a cracked board, since
with that many more processes should run into trouble, not only this particular one.

R' (.sig not from me .-)

--
There is something fascinating about science. One gets such wholesale
returns of conjecture out of such a trifling investment of fact.

Mark Twain

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Theo@3:770/3 to The Natural Philosopher on Fri Sep 15 11:58:12 2023

XPost: comp.os.linux.misc

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know
which part of the galaxy you inhabit, but cosmic rays are rare enough down
here that random bit flips like this don't happen often - ballpark once a
year for a server (which has a much greater surface area to absorb them than
a Pi).

It is also possible to be marginal on signal integrity for PCB interconnect, but that would mostly be a design fault: either they all work or many of
them don't. Since we don't have a lot of people complaining of the same problem, we can assume the design is not marginal in that respect.

If computers were that unreliable they would be failing all the time - and
we'd fit ECC to everything. That they aren't suggests bit-flip corruption isn't a problem. In general random bit-flip errors are not a statistically major source of crashes, unless you're running a hyper-redundant mainframe
and have eliminated all the other sources.

What are a well-known class of bugs are concurrency/timing races and memory safety violations. Which is odds-on what's happening here, especially given we've already picked up on potentially risky code like failing to check for NULL from fopen().

And that means thousands of faultless iterations in a day.

So this bug ( if it is a bug) is a one in a million or worse.

I suppose I could make the thing loop ten times a second (or even
faster) and see if it happens more often..

That would be a useful thing to try.

its not as though its chewing up CPU...

The problem I have is that these crashes only recently started
happening: prior to that the code ran for days. And two things happened,
a massive brownout, and then a full power cut, and I trod on it.

Most of those things would cause it to fail hard (ie not power up), rather
than have a very rare random fault.

And I made systemd start it...

It is possible that being run from systemd changes the timing or environment that provokes the fault in some way, but I doubt it would be the cause of
the fault.

Theo

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 11:58:08 2023

XPost: comp.os.linux.misc

The Natural Philosopher <tnp@invalid.invalid> writes:

On 15/09/2023 08:30, Richard Kettlewell wrote:

The Natural Philosopher <tnp@invalid.invalid> writes:

I am not sure valgrind would actually help unless it failed.

It’s extremely good at identifying memory corruption even in cases
where that doesn’t immediately lead to a crash; that’s what it’s for. >> But if it doesn’t, you leave it running until the crash happens.

Well that is an option for sure.

Up to you, of course, whether you use the tools available, or debug with
one hand tied behind your back.

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

Very dependent on the nature of the corruption. But you’ve already told
us there’s nothing in the kernel logs.

Anyway, not responsible for advice not taken.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 12:07:48 2023

XPost: comp.os.linux.misc

The Natural Philosopher <tnp@invalid.invalid> writes:

dir = opendir(VOLATILE_DIR);

if(!dir)
return;
while ((dp = readdir (dir)) != NULL)
{
filename=dp->d_name;
// skip known bollocks
if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
|| !strcmp(filename, "relays.dat" ))
continue;
// construct full path
sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);

Possible write overrun here.

stat(fullname,&stats);// get tfile times
if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
continue;
len=strlen(filename);
if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
continue;

Possible read under-run here. (But if it crashes then you’d expect
SIGSEGV rather than SIGABRT, so that’s probably not the issue.)

fp=fopen(fullname, "r");
if(fp==0) //file has disappeared?
continue;
len=fread(filbuf,1,255,fp);

I don’t think the declaration of filbuf has been posted, so there’s a possible write overrun if it’s less than 255 bytes.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 13:12:26 2023

XPost: comp.os.linux.misc

You trust the contents of 'outside'-files very much, do you? ;-)
I don't know who can create files in the directory you're scanning, but
not *assuring* the input you expect is another possible cause for
problems...

* The Natural Philosopher <tnp@invalid.invalid>
| > Further possibilities:
| > - how is 'filbuf' used after the fread()? If you use it as C-string, make | > sure it is 0-terminated (fread() won't do that for you). Maybe use
| > fgets(3) instead?
| >
| dir = opendir(VOLATILE_DIR);

| if(!dir)
| return;
| while ((dp = readdir (dir)) != NULL)
[looks good, error checks for stat() et al couldn't hurt]
--<snip-snip>--
| if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
| a file whose contents do not start with ZONE
| goto baddata;
|
| // looking very much like a temperature file
| i=(int)filbuf[4] -'1'; // this is our zone from
| "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
| '1'

The access of filbuf[4] is ok (since you checked that there are at least
4 characters in the file), but what if nothing follows after the 'ZONE',
or ZONE is followed by anything but [1-4]?

Assert that 'i' is in the valid index range here, before using it as

index into other arrays.

| p=strstr(filbuf,"\n");
| if(p)
| {
| p++;
| if(q=strstr(p,"\n"))
| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure

Memory leak if thermometers[i].name already contains something.

Other than that, I really would have it running under a debugger or
valgrind, since then *if* it crashes, you *know* *where* in your code it crashes.

Good luck hunting!
R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to Theo on Fri Sep 15 12:12:58 2023

XPost: comp.os.linux.misc

Theo <theom+news@chiark.greenend.org.uk> writes:

The Natural Philosopher <tnp@invalid.invalid> wrote:

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).

I’ve seen one inarguable random bit flip in several decades. In that
case the behavior was deterministic - chiark’s /bin/ls had got a
single-bit error, and caching meant it crashed _every_ time anyone ran
it.

Maybe TNP has taken a trip to Sizewell?

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Pancho@3:770/3 to The Natural Philosopher on Fri Sep 15 12:19:44 2023

XPost: comp.os.linux.misc

On 15/09/2023 10:46, The Natural Philosopher wrote:

On 15/09/2023 08:30, Richard Kettlewell wrote:

The Natural Philosopher <tnp@invalid.invalid> writes:

On 14/09/2023 09:23, Richard Kettlewell wrote:

Also:
* I would also have a look at the kernel log; if it’s a
   kernel-generated signal then there’s usually a log message about it.

Nothing in kern.log after the boot process finishes.

Most likely a bug in your program then.

* Run the application under valgrind; depending what the issue is, that >>>>     will provide a backtrace and perhaps more detailed information. >>>> If it
    is a memory corruption issue then it may identify where the
corruption
    happens, rather than the later point where malloc failed a
consistency
    check (or whatever it is).

Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.

The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.

I am not sure valgrind would actually help unless it failed.

It’s extremely good at identifying memory corruption even in cases where >> that doesn’t immediately lead to a crash; that’s what it’s for. But if
it doesn’t, you leave it running until the crash happens.

Well that is an option for sure.

Valgrind seems to be a modern version of Purify, which was absolutely essential, when I programmed C 30 years ago.

Personally, I want to run with full debug, stack trace, logging,
exception handling, and bounds checking turned on all the time, even in production. Which is why I generally use a modern language like C# or Java.

I'm with you on Python being rubbish, but have you considered something
like Rust? That gives you the benefit of a modern language, without
Garbage Collection pauses (if you care), or the need for a runtime
environment (like Python, C#, and Java).

Even using C++, would give you exception handling. C++ won't force you
to go too far, If you don't want to.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 13:03:52 2023

XPost: comp.os.linux.misc

On 15/09/2023 11:58, Theo wrote:

What are a well-known class of bugs are concurrency/timing races and memory safety violations. Which is odds-on what's happening here, especially given we've already picked up on potentially risky code like failing to check for NULL from fopen().

No, I do check it.

--
“It is dangerous to be right in matters on which the established
authorities are wrong.”

― Voltaire, The Age of Louis XIV

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Pancho@3:770/3 to Theo on Fri Sep 15 12:24:32 2023

XPost: comp.os.linux.misc

On 15/09/2023 11:58, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).

Lol! I thought cosmic rays when I read this thread.

Decades of having my nose rubbed in the shit of my own stupidity, I
guess. :-)

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 13:18:54 2023

XPost: comp.os.linux.misc

On 15/09/2023 12:07, Richard Kettlewell wrote:

The Natural Philosopher <tnp@invalid.invalid> writes:

dir = opendir(VOLATILE_DIR);

if(!dir)
return;
while ((dp = readdir (dir)) != NULL)
{
filename=dp->d_name;
// skip known bollocks
if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
|| !strcmp(filename, "relays.dat" ))
continue;
// construct full path
sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);

Possible write overrun here.

The filenames never change length.

stat(fullname,&stats);// get tfile times
if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
continue;
len=strlen(filename);
if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
continue;

Possible read under-run here. (But if it crashes then you’d expect
SIGSEGV rather than SIGABRT, so that’s probably not the issue.)

fp=fopen(fullname, "r");
if(fp==0) //file has disappeared?
continue;
len=fread(filbuf,1,255,fp);

I don’t think the declaration of filbuf has been posted, so there’s a possible write overrun if it’s less than 255 bytes.

char filbuf[256];
char fullname[256];

The fullname is of the form

/var/www/data/volatile/192.168.0.xx.dat

There are no other files apart from 'relay.dat' in that directory.

I mean you are all throwing noob bugs at me. Yes, in 1984 that's the
sort of shit I used to write. Not these days.

I have a drawer full of T shirts marked 'buffer overrun' 'alloc without
free' 'fopen without fclose'.

The fact is the memory footprint does not increase. So there are no
obvious or simple memory leaks.

I've absolutely covered every error case mentioned here in the one case
of the files that get written and read every few seconds.

It occurs to me that this behaviour started when I made it autoboot
under systemd as well.

Since the consensus seems to be it isn't hardware, or file corruption, I
am trying it launched manually to see if it crashes or not.

Systemd does seem to wrap things in resource limits, and start with a
slightly different ENV although I cant see that any are being exceeded.

If it wasn't a daemon I would expect it to segfault and show that on
screen. I could run it without daemonising it as well.

So lots of options to try.

As well as soft debuggers.

--
“It is dangerous to be right in matters on which the established
authorities are wrong.”

― Voltaire, The Age of Louis XIV

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 13:06:04 2023

XPost: comp.os.linux.misc

On 15/09/2023 12:12, Richard Kettlewell wrote:

Theo <theom+news@chiark.greenend.org.uk> writes:

The Natural Philosopher <tnp@invalid.invalid> wrote:

Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?

Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's
possible to bit-flip in RAM or storage without it noticing. I don't know
which part of the galaxy you inhabit, but cosmic rays are rare enough down >> here that random bit flips like this don't happen often - ballpark once a
year for a server (which has a much greater surface area to absorb them than >> a Pi).

I’ve seen one inarguable random bit flip in several decades. In that
case the behavior was deterministic - chiark’s /bin/ls had got a
single-bit error, and caching meant it crashed _every_ time anyone ran
it.

Maybe TNP has taken a trip to Sizewell?

LOL!

Nope.

I am trying some stuff out to try and get it to fail *consistently*.

I dont feel its hugely profitable to attempt to debug it when most of
the time its not doing anything wrong

--
“It is dangerous to be right in matters on which the established
authorities are wrong.”

― Voltaire, The Age of Louis XIV

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From candycanearter07@3:770/3 to The Natural Philosopher on Fri Sep 15 07:53:26 2023

XPost: comp.os.linux.misc

On 9/15/23 04:16, The Natural Philosopher wrote:

On 14/09/2023 20:40, candycanearter07 wrote:

On 9/14/23 13:42, The Natural Philosopher wrote:

I assumed that the thing would have done its own fsck on every boot
anyway...isnt that a debian default?

Pretty sure it's a standard, my arch install has it set.

(The sixth fields are 2 and 1 respectively for the file systems)

PARTUUID=b8c9fbb7-01 /boot           vfat    defaults
0       2
PARTUUID=b8c9fbb7-02 /               ext4    defaults,noatime
0       1

1 is fsck check for the root partition and 2 is for others, right

I looked it up, it merely specifies the order I think, so you are right
in practice.

Oh, the thing I learned was that you should always put root as 1 and
everything else as 2 ^^" but that makes more sense

--
--
user <candycane> is generated from /dev/urandom

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 13:24:16 2023

XPost: comp.os.linux.misc

On 15/09/2023 12:12, Ralf Fassel wrote:

You trust the contents of 'outside'-files very much, do you? ;-)
I don't know who can create files in the directory you're scanning, but
not *assuring* the input you expect is another possible cause for
problems...

* The Natural Philosopher <tnp@invalid.invalid>
| > Further possibilities:
| > - how is 'filbuf' used after the fread()? If you use it as C-string, make
| > sure it is 0-terminated (fread() won't do that for you). Maybe use
| > fgets(3) instead?
| >
| dir = opendir(VOLATILE_DIR);

| if(!dir)
| return;
| while ((dp = readdir (dir)) != NULL)
[looks good, error checks for stat() et al couldn't hurt]
--<snip-snip>--
| if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
| a file whose contents do not start with ZONE
| goto baddata;
|
| // looking very much like a temperature file
| i=(int)filbuf[4] -'1'; // this is our zone from
| "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
| '1'

The access of filbuf[4] is ok (since you checked that there are at least
4 characters in the file), but what if nothing follows after the 'ZONE',
or ZONE is followed by anything but [1-4]?

That cannot happen. Its hard wired into the code that writes the file

Assert that 'i' is in the valid index range here, before using it as

index into other arrays.

| p=strstr(filbuf,"\n");
| if(p)
| {
| p++;
| if(q=strstr(p,"\n"))
| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

Other than that, I really would have it running under a debugger or
valgrind, since then *if* it crashes, you *know* *where* in your code it crashes.

Last resort. I have to learn how to *use* those tools.
Right now I am working on other stuff and am content to change one thing
at a time to see if that makes any difference.

That is a low user time strategy.

Good luck hunting!
R'

Thank you. The input has been valuable. And I now have further
strategies in reserve.

As with all intermittent faults, the thing you need most is a reliable
way to make the fault occur.

--
"The great thing about Glasgow is that if there's a nuclear attack it'll
look exactly the same afterwards."

Billy Connolly

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 14:56:22 2023

XPost: comp.os.linux.misc

On 15/09/2023 14:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store >> values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.

Hmm. I free the pointers even for relay zones that don't have
thermometers, whose pointers are 0. That isn't an issue.

But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.

Its used elsewhere
Well I don't think its an issue, but I can zero the pointers anyway
after free()ing

Theo

--
"I guess a rattlesnake ain't risponsible fer bein' a rattlesnake, but ah
puts mah heel on um jess the same if'n I catches him around mah chillun".

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Theo@3:770/3 to The Natural Philosopher on Fri Sep 15 14:23:48 2023

XPost: comp.os.linux.misc

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there
was some code path that was breaking out of the loop or similar you might
get such behaviour.

Theo

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 16:12:56 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| > | if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
| > | a file whose contents do not start with ZONE
| > | goto baddata;
| > |
| > | // looking very much like a temperature file
| > | i=(int)filbuf[4] -'1'; // this is our zone from
| > | "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
| > | '1'
| > The access of filbuf[4] is ok (since you checked that there are at
| > least
| > 4 characters in the file), but what if nothing follows after the 'ZONE',
| > or ZONE is followed by anything but [1-4]?

| That cannot happen. Its hard wired into the code that writes the file

Depending on the permissions of VOLATILE_DIR, it *might* be possible
that *anybody* can drop files in there. Save some "// skip known
bollocks", you just scan every file in VOLATILE_DIR. If I were an
attacker, I sure would try to use that vector, regardless whether the
program in question runs with elevated permissions or not ;-)

| > Other than that, I really would have it running under a debugger or
| > valgrind, since then *if* it crashes, you *know* *where* in your code it
| > crashes.
| >
| Last resort. I have to learn how to *use* those tools.

With valgrind, it is as easy as putting 'valgrind' in front of the
commandline you use to start your program. With gdb, it is a tiny bit
more complicated, agreed. But since these tools are worth learning
anyway for any programmer, the time invested in learning them is not
wasted.

R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 16:27:46 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| > | thermometers[i].name=strdup(p); //
| > | make a copy of the name and attach it
| > | to our thermometer structure
| > Memory leak if thermometers[i].name already contains something.
| >
| further up the line...

| bzero(filbuf,sizeof(filbuf));
| /** first thing to do is clean any allocated memory used to
| store values. **/
| for(i=0;i<NUMBER_RELAYS;i++)
| free(thermometers[i].name);

Note that the assignment

thermometers[i].name=strdup(p);

is *inside* the while() loop without a free().

Probably you argue that there ever is only a single file to read in that
dir anyway... Personally, I've been bitten by such assumptions, so I'd
rather check once too often than hunting down "can't happen" bugs.

R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 15:32:44 2023

XPost: comp.os.linux.misc

On 15/09/2023 14:23, Theo wrote:

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.

Well, I am not sure if that was it or not, but I deleted manually a
thermometer file and the thing crashed instantly. That is consistent
with the name having been set once, and then repeatedly free()ed. I then installed the code with the free()ed pointers set to NULL, and it
*didn't* crash instantly.

I had assumed that freeing a pointer that already had been freed would
either result in a NO-OP because the pointer no longer existed in the
heap memory allocation tables, or it would instantly crash , but it
seems that the action is 'undefined'.

Not sure that's done the trick, because I don't quite see how a file
could ever cease to exist.

To not exist in the first place is one thing, but once written, nothing
should delete them.

Unless fopen("w") does that for a fraction of a microsecond

Or fopen("w") creates an *empty* file, in which case it is *just*
possible that an empty file is read, no strdup was done and the pointer
was double freed...next time around.

Academic now anyway. Pointers all set to null after freeing. Defined
behaviour. frees on NULL ignored.

I'll let it run and run and see.

--
The biggest threat to humanity comes from socialism, which has utterly
diverted our attention away from what really matters to our existential survival, to indulging in navel gazing and faux moral investigations
into what the world ought to be, whilst we fail utterly to deal with
what it actually is.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From candycanearter07@3:770/3 to Theo on Fri Sep 15 09:40:00 2023

XPost: comp.os.linux.misc

On 9/15/23 08:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store >> values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.

Theo

I thought double free was a SIGSEGV?
--
--
user <candycane> is generated from /dev/urandom

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 15:55:16 2023

XPost: comp.os.linux.misc

On 15/09/2023 15:27, Ralf Fassel wrote:

* The Natural Philosopher <tnp@invalid.invalid>
| > | thermometers[i].name=strdup(p); //
| > | make a copy of the name and attach it
| > | to our thermometer structure
| > Memory leak if thermometers[i].name already contains something.
| >
| further up the line...

| bzero(filbuf,sizeof(filbuf));
| /** first thing to do is clean any allocated memory used to
| store values. **/
| for(i=0;i<NUMBER_RELAYS;i++)
| free(thermometers[i].name);

Note that the assignment

thermometers[i].name=strdup(p);

is *inside* the while() loop without a free().

Probably you argue that there ever is only a single file to read in that
dir anyway... Personally, I've been bitten by such assumptions, so I'd rather check once too often than hunting down "can't happen" bugs.

R'

No. you have misunderstood how the code works.

There are up to 4 (NUMBER_RELAYS) thermometer files in that dir, and all
of them are read in the loop. What there shouldn't be is more than one
file with a ZONE number the same. So no pointer gets more than one STRDUP

If there were, it might be possible to strdup the same pointer twice.
And the daemon would get a memory leak and crash.

(It would be trivial to simply add a conditional that only strdups to a
pointer if it is NULL).

That is a possibility that could be caused by mis-configuration of the thermometers themselves.

However they are not at this time misconfigured, so it shouldn't be the
crash problem, although it is an issue I will consider because fat
fingers *could* cause it.

I do think that what has happened is that a valid file name has been
found with empty data, or no file at all, and then no strdup is done -
but the free is, next time around.

That should never happen of course, as the fopen/fwrite sequence should certainly not delete the filename, but it is entirely possible that a
the fopen *truncates* its data. At which point we cant strdup anything,
so the next free gets a woopsie

Setting the pointers to NULL after free() is nice defensive coding

As is allocating memory only if the pointers are null.

So both are in there now.

--
“Progress is precisely that which rules and regulations did not foresee,”

– Ludwig von Mises

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 15:00:24 2023

XPost: comp.os.linux.misc

In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:

I had assumed that freeing a pointer that already had been freed would
either result in a NO-OP because the pointer no longer existed in the
heap memory allocation tables, or it would instantly crash , but it
seems that the action is 'undefined'.

Yes, C explicitly labels "double free" as "undefined":

<http://port70.net/~nsz/c/c99/n1256.html#J.2>

Look under J.2 Undefined behavior (easiest is to search for "free"):

J.2 Undefined behavior

1 The behavior is undefined in the following circumstances:

...

The pointer argument to the free or realloc function does not match
a pointer earlier returned by calloc, malloc, or realloc, or the
space has been deallocated by a call to free or realloc (7.20.3.2,
7.20.3.4).

And th 7.20.3.2 link in the page jumps to this:

The free function causes the space pointed to by ptr to be
deallocated, that is, made available for further allocation. If
ptr is a null pointer, no action occurs. Otherwise, if the
argument does not match a pointer earlier returned by the calloc,
malloc, or realloc function, or if the space has been deallocated
by a call to free or realloc, the behavior is undefined.

So if by chance you are double-freeing sometimes, then you are tickling
the undefined behaviour devil, and all bets are off as to what might
eventually occur.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to All on Fri Sep 15 16:06:16 2023

XPost: comp.os.linux.misc

On 15/09/2023 15:40, candycanearter07 wrote:

On 9/15/23 08:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

|                             {
|                             *q++=0;
|                             thermometers[i].name=strdup(p); //
|                             make a copy of the name and attach it
|                             to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

         bzero(filbuf,sizeof(filbuf));
         /** first thing to do is clean any allocated memory used to
store
values. **/
         for(i=0;i<NUMBER_RELAYS;i++)
                 free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you
call strdup() exactly once, and subsequently free() exactly once? If
there
was some code path that was breaking out of the loop or similar you might
get such behaviour.

Theo

I thought double free was a SIGSEGV?

In fact it seems fairly undefined

It looks like it is somewhat implementation dependent. SIGSEGV means you accessed unallocated memory, but that is not the same as freeing
allocated memory, twice.

There seem to be instances of it reported. Google is a friend here.

I *suspect* that if that is the problem, its a signal from deep within libc. Whereas SIGSEGV probably emanates from a memory management unit somewhere

--
"Strange as it seems, no amount of learning can cure stupidity, and
higher education positively fortifies it."

- Stephen Vizinczey

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Richard Kettlewell@3:770/3 to no@thanks.net on Fri Sep 15 16:09:18 2023

XPost: comp.os.linux.misc

candycanearter07 <no@thanks.net> writes:

On 9/15/23 08:23, Theo wrote:

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.

I thought double free was a SIGSEGV?

If Glibc detects it you’ll get a diagnostic and SIGABRT.

If it doesn’t detect it then anything could happen - SIGSEGV is just one possibility.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From vallor@3:770/3 to tnp@invalid.invalid on Fri Sep 15 15:12:24 2023

XPost: comp.os.linux.misc

On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

On 15/09/2023 14:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it |
to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to
store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i
you call strdup() exactly once, and subsequently free() exactly once?
If there was some code path that was breaking out of the loop or
similar you might get such behaviour.

Hmm. I free the pointers even for relay zones that don't have
thermometers, whose pointers are 0. That isn't an issue.

But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.

Its used elsewhere Well I don't think its an issue, but I can zero the pointers anyway after free()ing

Theo

Hi, read the thread with interest.

If you're getting SIGABRT, that's almost always the software
calling abort(3). If you aren't, maybe there's a library calling it?

$ man 7 signal
[...]
Signal Standard Action Comment
SIGABRT P1990 Core Abort signal from abort(3)
[but it also lists]
SIGIOT - Core IOT trap. A synonym for SIGABRT
_ _ _ _ _ _ _

Meanwhile, if you want to avoid locking your file, you might want to write
a fresh file with a unique name, then rename() it,
which -- please correct me if I'm wrong -- should replace
the desired file atomically.

--
-v

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Martin Gregorie@3:770/3 to Pancho on Fri Sep 15 15:17:14 2023

XPost: comp.os.linux.misc

On Fri, 15 Sep 2023 12:19:45 +0100, Pancho wrote:

Personally, I want to run with full debug, stack trace, logging,
exception handling, and bounds checking turned on all the time, even in production. Which is why I generally use a modern language like C# or
Java.

Same here. Many years back I wrote the type of debugging and programming support library I personally find most useful: it can report the content
of all common variable types as well as dumping byte arrays as both hex
and ASCII as well as parsing the command line and allow the amount of
debug info the be controlled by a command line argument.

They are structured as small libraries that designed to be lightweight
enough to be left in a program when its in general use.

The library was originally written in C, but I soon wrote a Java version
as well, though this hasn't been separately published yet.

If this sounds useful, both versions can be found on www.libelle-
systems.com in the "Free Stuff" section.

--

Martin | martin at
Gregorie | gregorie dot org

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Rich@3:770/3 to no@thanks.net on Fri Sep 15 15:02:26 2023

XPost: comp.os.linux.misc

In comp.os.linux.misc candycanearter07 <no@thanks.net> wrote:

I thought double free was a SIGSEGV?

Check my other reply to TNP for the details, but it is "undefined" in
C.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 16:37:56 2023

XPost: comp.os.linux.misc

On 15/09/2023 16:09, Richard Kettlewell wrote:

candycanearter07 <no@thanks.net> writes:

On 9/15/23 08:23, Theo wrote:

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.

I thought double free was a SIGSEGV?

If Glibc detects it you’ll get a diagnostic and SIGABRT.

I think that is conclusive.

It seems to have been a double free caused by lack of defensive coding
plus an asynch timed file write function causing the temporary creation
of an empty file, or perhaps no file at all.

If it doesn’t detect it then anything could happen - SIGSEGV is just one possibility.

--
I would rather have questions that cannot be answered...
...than to have answers that cannot be questioned

Richard Feynman

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 15:26:10 2023

XPost: comp.os.linux.misc

In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 15:27, Ralf Fassel wrote:

Note that the assignment

thermometers[i].name=strdup(p);

is *inside* the while() loop without a free().

Probably you argue that there ever is only a single file to read in
that dir anyway... Personally, I've been bitten by such
assumptions, so I'd rather check once too often than hunting down
"can't happen" bugs.

I do think that what has happened is that a valid file name has been
found with empty data, or no file at all, and then no strdup is done
- but the free is, next time around.

That should never happen of course, as the fopen/fwrite sequence
should certainly not delete the filename, but it is entirely possible
that a the fopen *truncates* its data. At which point we cant strdup anything, so the next free gets a woopsie

Are the "files" being written to by an independent process separate
from this reading process?

If yes, are you doing any form of locking/synchronization to prevent
the reading process from trying to read from a file that a writing
process has open/truncated, but not yet written any data into?

If no, then you may be also hitting a race condition where the stars
align just right, the writer has just performed its fopen/truncate
(leaving the file empty) and the kernel decides to context switch away
to the reader at that point, before the writer can write and close the
file. The reader will then see an empty file.

The classic "lock free" solution to this one is for the writer to
create and write to a temporary file, and after closing the temp file
to rename() it to the name of the real file. Rename is documented to
be atomic, so the reader would never see a half open, or partially
complete, file in this case.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to Rich on Fri Sep 15 16:44:54 2023

XPost: comp.os.linux.misc

On 15/09/2023 16:26, Rich wrote:

In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 15:27, Ralf Fassel wrote:

Note that the assignment

thermometers[i].name=strdup(p);

is *inside* the while() loop without a free().

Probably you argue that there ever is only a single file to read in
that dir anyway... Personally, I've been bitten by such
assumptions, so I'd rather check once too often than hunting down
"can't happen" bugs.

I do think that what has happened is that a valid file name has been
found with empty data, or no file at all, and then no strdup is done
- but the free is, next time around.

That should never happen of course, as the fopen/fwrite sequence
should certainly not delete the filename, but it is entirely possible
that a the fopen *truncates* its data. At which point we cant strdup
anything, so the next free gets a woopsie

Are the "files" being written to by an independent process separate
from this reading process?

Yes

If yes, are you doing any form of locking/synchronization to prevent
the reading process from trying to read from a file that a writing
process has open/truncated, but not yet written any data into?

No.

If no, then you may be also hitting a race condition where the stars
align just right, the writer has just performed its fopen/truncate
(leaving the file empty) and the kernel decides to context switch away
to the reader at that point, before the writer can write and close the
file. The reader will then see an empty file.

I think that is exactly the case. I didnt think that was in fact possible

The classic "lock free" solution to this one is for the writer to
create and write to a temporary file, and after closing the temp file
to rename() it to the name of the real file. Rename is documented to
be atomic, so the reader would never see a half open, or partially
complete, file in this case.

Yes, I was just wondering that before I read this post. Rename unlinks
the old file does it?

I might implement that, as well. It doesn't really matter however, as
in practice the structures than contain thermometer data don't get
altered if no valid data is found, so the lack of a proper file, ex of
causing a crash, now simply means the (unused in this program) name data
gets erased. For a few seconds. It simply misses a reading and uses last
times data for everything else. Mostly the temperature.

--
Truth welcomes investigation because truth knows investigation will lead
to converts. It is deception that uses all the other techniques.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From The Natural Philosopher@3:770/3 to vallor on Fri Sep 15 16:46:42 2023

XPost: comp.os.linux.misc

On 15/09/2023 16:12, vallor wrote:

On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

On 15/09/2023 14:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it | >>>>> to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to
store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i
you call strdup() exactly once, and subsequently free() exactly once?
If there was some code path that was breaking out of the loop or
similar you might get such behaviour.

Hmm. I free the pointers even for relay zones that don't have
thermometers, whose pointers are 0. That isn't an issue.

But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.

Its used elsewhere Well I don't think its an issue, but I can zero the
pointers anyway after free()ing

Theo

Hi, read the thread with interest.

If you're getting SIGABRT, that's almost always the software
calling abort(3). If you aren't, maybe there's a library calling it?

$ man 7 signal
[...]
Signal Standard Action Comment
SIGABRT P1990 Core Abort signal from abort(3)
[but it also lists]
SIGIOT - Core IOT trap. A synonym for SIGABRT
_ _ _ _ _ _ _

Meanwhile, if you want to avoid locking your file, you might want to write
a fresh file with a unique name, then rename() it,
which -- please correct me if I'm wrong -- should replace
the desired file atomically.

I think the consensus is that it does.

Presumably if the read process has the old file open, that will be valid
until it closes it?

--
"I guess a rattlesnake ain't risponsible fer bein' a rattlesnake, but ah
puts mah heel on um jess the same if'n I catches him around mah chillun".

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 18:13:44 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| On 15/09/2023 15:27, Ralf Fassel wrote:
| > * The Natural Philosopher <tnp@invalid.invalid>
| > | > | thermometers[i].name=strdup(p); //
| > | > | make a copy of the name and attach it
| > | > | to our thermometer structure
| > | > Memory leak if thermometers[i].name already contains something.
| > | >
| > | further up the line...
| >>
| > | bzero(filbuf,sizeof(filbuf));
| > | /** first thing to do is clean any allocated memory used to
| > | store values. **/
| > | for(i=0;i<NUMBER_RELAYS;i++)
| > | free(thermometers[i].name);
| > Note that the assignment
| > thermometers[i].name=strdup(p);
| > is *inside* the while() loop without a free().
| > Probably you argue that there ever is only a single file to read in
| > that dir anyway... Personally, I've been bitten by such assumptions, so I'd
| > rather check once too often than hunting down "can't happen" bugs.
| > R'
| >
| No. you have misunderstood how the code works.

Sorry, but I have to give that compliment back. You describe how the
code is _intended_ to work. I described how the code _actually_ works.

It all depends on what files with which content are there in that
directory, so if there ever is only one file per ZONE, all is peachy.
If not, all bets are off.

Not 100% seriously, may I refer you to
https://core.tcl-lang.org/tips/doc/trunk/tip/131.md
;-)

| (It would be trivial to simply add a conditional that only strdups to
| a pointer if it is NULL).

With char* malloc'd pointers, I find it much easier to simply stick to
the pattern:
- initialize to 0
- free before reassignment
- assign to 0 after free when not directly reassigning
instead of arguing at each place why not sticking to the pattern is not
a problem.

| However they are not at this time misconfigured, so it shouldn't be
| the crash problem, [...]

Agreed.

| I do think that what has happened is that a valid file name has been
| found with empty data, or no file at all, and then no strdup is done -
| but the free is, next time around.

Easy to verify via diagnostics, just add a stderr-output for every
unexpected situation (such as the same index seen twice etc).

| As is allocating memory only if the pointers are null.

Why not simply free()/strdup()? If you assign to 0 only, you may get
old contents for the new file inside the loop (can't happen, I know :-)!

R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Ralf Fassel@3:770/3 to All on Fri Sep 15 18:19:12 2023

XPost: comp.os.linux.misc

* The Natural Philosopher <tnp@invalid.invalid>
| On 15/09/2023 16:12, vallor wrote:
| > Meanwhile, if you want to avoid locking your file, you might want to
| > write
| > a fresh file with a unique name, then rename() it,
| > which -- please correct me if I'm wrong -- should replace
| > the desired file atomically.

| I think the consensus is that it does.

| Presumably if the read process has the old file open, that will be
| valid until it closes it?

On Linux: yes. Once a process has a file open, it sees the 'old'
contents if the file is removed from disk.

https://stackoverflow.com/questions/2028874/what-happens-to-an-open-file-handle-on-linux-if-the-pointed-file-gets-moved-or-d

R'

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From vallor@3:770/3 to Ralf Fassel on Fri Sep 15 16:28:02 2023

XPost: comp.os.linux.misc

On Fri, 15 Sep 2023 18:19:13 +0200, Ralf Fassel <ralfixx@gmx.de> wrote in <ygav8cbh0ji.fsf@akutech.de>:

* The Natural Philosopher <tnp@invalid.invalid>
| On 15/09/2023 16:12, vallor wrote:
| > Meanwhile, if you want to avoid locking your file, you might want to
| > write | > a fresh file with a unique name, then rename() it,
| > which -- please correct me if I'm wrong -- should replace | > the
desired file atomically.

| I think the consensus is that it does.

| Presumably if the read process has the old file open, that will be |
valid until it closes it?

On Linux: yes. Once a process has a file open, it sees the 'old'
contents if the file is removed from disk.

https://stackoverflow.com/questions/2028874/what-happens-to-an-open-

file-handle-on-linux-if-the-pointed-file-gets-moved-or-d

R'

Speaking of which: back in the days of Linux yore, you
could retrieve the contents of a delete file if a
process still had it open through: /proc/##/fd/*.

(Nowadays, those are symlinks.)

--
-v

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From vallor@3:770/3 to tnp@invalid.invalid on Fri Sep 15 16:21:58 2023

XPost: comp.os.linux.misc

On Fri, 15 Sep 2023 16:46:43 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1u93$3a7pg$3@dont-email.me>:

On 15/09/2023 16:12, vallor wrote:

On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher
<tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

On 15/09/2023 14:23, Theo wrote:

In comp.sys.raspberry-pi The Natural Philosopher
<tnp@invalid.invalid> wrote:

On 15/09/2023 12:12, Ralf Fassel wrote:

| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it >>>>>> |
to our thermometer structure

Memory leak if thermometers[i].name already contains something.

further up the line...

bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used
to store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);

You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.

Hmm. I free the pointers even for relay zones that don't have
thermometers, whose pointers are 0. That isn't an issue.

But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.

Its used elsewhere Well I don't think its an issue, but I can zero the
pointers anyway after free()ing

Theo

Hi, read the thread with interest.

If you're getting SIGABRT, that's almost always the software calling
abort(3). If you aren't, maybe there's a library calling it?

$ man 7 signal [...]
Signal Standard Action Comment SIGABRT P1990
Core Abort signal from abort(3)
[but it also lists]
SIGIOT - Core IOT trap. A synonym for SIGABRT
_ _ _ _ _ _ _

Meanwhile, if you want to avoid locking your file, you might want to
write a fresh file with a unique name, then rename() it,
which -- please correct me if I'm wrong -- should replace the desired
file atomically.

I think the consensus is that it does.

Presumably if the read process has the old file open, that will be valid until it closes it?

Yes -- and the old file remains allocated on disk until
its file descriptor is closed.

--
-v

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 18:27:20 2023

XPost: comp.os.linux.misc

In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:

On 15/09/2023 16:26, Rich wrote:

Are the "files" being written to by an independent process separate
from this reading process?

Yes

If yes, are you doing any form of locking/synchronization to prevent
the reading process from trying to read from a file that a writing
process has open/truncated, but not yet written any data into?

No.

If no, then you may be also hitting a race condition where the stars
align just right, the writer has just performed its fopen/truncate
(leaving the file empty) and the kernel decides to context switch
away to the reader at that point, before the writer can write and
close the file. The reader will then see an empty file.

I think that is exactly the case. I didnt think that was in fact
possible

It is. One of the points where Linux evaluates to determe if it should
task switch is upon exit from a syscall. If your writer process runs
out its timeslice during the in-kernel portion of the work for an
fopen, then the kernel will suspend it and schedule another process to
run. You now have an empty, unwritten file on disk which will not be
written to until the writer is next scheduled by the kernel. If the
next process scheduled is the reader, and it was last suspended just
before it did an fopen() on this same file, it will now fopen() an
empty file.

The classic "lock free" solution to this one is for the writer to
create and write to a temporary file, and after closing the temp file
to rename() it to the name of the real file. Rename is documented to
be atomic, so the reader would never see a half open, or partially
complete, file in this case.

Yes, I was just wondering that before I read this post. Rename unlinks
the old file does it?

Yes: (man 2 rename):

If newpath already exists, it will be atomically replaced, so that
there is no point at which another process attempting to access
newpath will find it missing. However, there will probably be a
window in which both oldpath and newpath refer to the file being
renamed.

I might implement that, as well. It doesn't really matter however,
as in practice the structures than contain thermometer data don't get
altered if no valid data is found, so the lack of a proper file, ex
of causing a crash, now simply means the (unused in this program)
name data gets erased. For a few seconds. It simply misses a
reading and uses last times data for everything else. Mostly the temperature.

Yes, your temperature monitoring was unaffected. But if the race was
sometimes triggering the pointer double-free that your loop previously
had, then the lack of atomicity was at least one trigger for the
intermittent crash.

So seems like two routes to fix:

1) remove the conditions that can cause a double-free to occur in the
code (seems like you've already done this from other posts)

2) use rename() to move newly written files into place for the reader,
so the reader never opens an empty file (exclusive of the writer
crashing before it wrote anything to the file).

For something that you'll potentially want to 'just run' for
months/years on end without daily care and feeding, doing both is the
better defense.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

Who's Online

System Info

Sysop:	Coz
Location:	Anoka, MN
Users:	2
Nodes:	4 (0 / 4)
Uptime:	38:13:00
Calls:	360
Files:	6,326
Messages:	234,139

Weird code crash

Who's Online

System Info