One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.
READ
====
fp=fopen(fullname, "r");
On Thu, 14 Sep 2023 06:23:15 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:
One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write
operations are atomic and done with C code.
READ
====
fp=fopen(fullname, "r");
Anything opened with fopen is a buffered stream operations on it
are not atomic so yes it is very possible for the read to see a partially written file. To avoid the race you need to use some kind of locking.
READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file
WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}
Could this cause a problem?
I tend to suspect some sort of asynchronous timing issue because it is
such a rare occurrence. I have been utterly unable to make it happen
on demand...
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
CPU time.
I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.
I restarted it manually, and it hasn't crashed since.
The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.
I don't expect people to know the answer, but I could use some help in puzzling out where to look.
I had a power cut that did leave my network a bit sketchy and it took
two reboots on this desktop to get back to normal. This may or may not
be relevant.
But my question refers to my Pi Zero W server I am developing.
It came up, ok, but then after a while my relay daemon crashed...
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
CPU time.
I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.
I restarted it manually, and it hasn't crashed since.
The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.
One person 'fixed' it by changing CPUs...
Now *as far as I know* there was nothing special about the data the
daemon would be operating on it this point to cause it to crash. I am
fairly sure I have no memory leaks in it - in normal operation it
strdups() and frees() and opens and closes files... and 'top' shows
memory usage is rock steady.
One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.
READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file
WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}
Could this cause a problem?
I tend to suspect some sort of asynchronous timing issue because it is
such a rare occurrence. I have been utterly unable to make it happen on demand...
Howver I think that for small operations one would have to posit a time between fopen() and fread() in which the file 'disappears' in some
sense. Burt I 8thought* that a file handle once issued would not point
to empty data, and that in fact fopen('w") would in fact create a new
file and the old would not get unlinked until it was 'fclosed'
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
15.074s
CPU time.
I rebooted it, and after awhile - about ten minutes, it happened again -
that is the above trace.
I restarted it manually, and it hasn't crashed since.
The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic
message possibly to do with memory issues.
You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.
I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:
$ gdb ./myprog
(gdb) run
and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:
(gdb) bt
It may be that starting it under systemd is different in some way that it doesn't show up when running it by hand. You could try setting as your systemd command:
gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2
which will run it and then dump a backtrace when it's finished. You may get 'no stack' if it succeeded and didn't record one.
On Thu, 14 Sep 2023 07:57:45 +0100
The Natural Philosopher <tnp@invalid.invalid> wrote:
Howver I think that for small operations one would have to posit a time
between fopen() and fread() in which the file 'disappears' in some
sense. Burt I 8thought* that a file handle once issued would not point
to empty data, and that in fact fopen('w") would in fact create a new
file and the old would not get unlinked until it was 'fclosed'
Nope - from man fopen
“w” Open for writing. The stream is positioned at the beginning of
the file. Truncate the file to zero length if it exists or
create the file if it does not exist.
Theo <theom+news@chiark.greenend.org.uk> writes:
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
process exit
ed, code=killed, status=6/ABRT
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
with resul
t 'signal'.
Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
15.074s
CPU time.
I rebooted it, and after awhile - about ten minutes, it happened again - >>> that is the above trace.
I restarted it manually, and it hasn't crashed since.
The web is flooded with instances of this messaqe all on different
platforms and applications, and it would appear this is a very generic
message possibly to do with memory issues.
You're getting SIGABRT which is typically something bailing due to memory
corruption, eg corrupting metadata so that malloc can't work, or a
double-free.
I would compile it with debugging enabled: '-g' or '-ggdb' flag to your
compiler. Then run it under gdb:
$ gdb ./myprog
(gdb) run
and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:
(gdb) bt
It may be that starting it under systemd is different in some way that it
doesn't show up when running it by hand. You could try setting as your
systemd command:
gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2
which will run it and then dump a backtrace when it's finished. You may get >> 'no stack' if it succeeded and didn't record one.
Also:
* I would also have a look at the kernel log; if it’s a kernel-generated
signal then there’s usually a log message about it.
* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it
is a memory corruption issue then it may identify where the corruption
happens, rather than the later point where malloc failed a consistency
check (or whatever it is).
Using valgrind (and/or compiler sanitizer features) is a good idea even before running into trouble, really.
You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.
I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:
$ gdb ./myprog
(gdb) run
and see if it dies. If it does you can get a backtrace to indicate where
the fault occurred:
(gdb) bt
The strange thing is that it failed once after a minute, then I rebooted
and it failed after 20 minutes, and its been running several days now
with no issues at all.
I am not sure valgrind would actually help unless it failed.
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
The strange thing is that it failed once after a minute, then I rebooted
and it failed after 20 minutes, and its been running several days now
with no issues at all.
I am not sure valgrind would actually help unless it failed.
valgrind will tell you if it spots memory corruption, even if the corruption is not yet enough to cause it to crash. It may help in making the problem clearer and deterministic where the corruption makes it unpredictable.
Theo
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.
R'
I don't expect people to know the answer, but I could use some help in puzzling out where to look.
One possibility is that it is opening and reading a file at the precise
time another process is writing it...in both cases the read and write operations are atomic and done with C code.
READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file
WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}
On 14/09/2023 16:29, Ralf Fassel wrote:
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.
R'
an hour, it crashed again.
The filename is built by scanning a directory so the filename must exist.
The code runs as root, so there are no perms issues
I've put in checks to avoid trying to read empty files
I am leaning towards possibly a cracked solder joint or board.
On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:
On 14/09/2023 16:29, Ralf Fassel wrote:
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem >>> if the file is not accessible for any reason.
R'
an hour, it crashed again.
The filename is built by scanning a directory so the filename must exist.
The code runs as root, so there are no perms issues
I've put in checks to avoid trying to read empty files
I am leaning towards possibly a cracked solder joint or board.
Have you run fsck on the file system since the power loss? Make sure the fstab
entry does not have a zero in the sixth field for the file system(s) in
use.
If using systemd, run dracut -f after any fstab changes. Then reboot.
Regards, Dave Hodgins
On 14/09/2023 18:44, David W. Hodgins wrote:
On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher
<tnp@invalid.invalid> wrote:
On 14/09/2023 16:29, Ralf Fassel wrote:
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after >>> an hour, it crashed again.
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem >>>> if the file is not accessible for any reason.
R'
The filename is built by scanning a directory so the filename must exist. >>>
The code runs as root, so there are no perms issues
I've put in checks to avoid trying to read empty files
I am leaning towards possibly a cracked solder joint or board.
Have you run fsck on the file system since the power loss? Make sure the
fstab
entry does not have a zero in the sixth field for the file system(s) in
use.
If using systemd, run dracut -f after any fstab changes. Then reboot.
Regards, Dave Hodgins
I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?
(The sixth fields are 2 and 1 respectively for the file systems)
PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2 PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1
On 14/09/2023 06:23, The Natural Philosopher wrote:
I don't expect people to know the answer, but I could use some help in
puzzling out where to look.
One possibility is that it is opening and reading a file at the
precise time another process is writing it...in both cases the read
and write operations are atomic and done with C code.
READ
====
fp=fopen(fullname, "r");
len=fread(filbuf,1,255,fp); // read entire file
Elsewhere in this thread it is suggested checking fp!=nul.
Not knowing what the actual program is doing might I suggest also
closing fp after it has been read.
WRITE
=====
fp=fopen(filename, "w");
if (fp)
{
fprintf(fp,"%s%s\n",filedata,timestamp);
fclose(fp);
}
journalctl -b --no-h|grep fsck
I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?
(The sixth fields are 2 and 1 respectively for the file systems)
PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2
PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1
On 14/09/2023 19:53, David W. Hodgins wrote:
journalctl -b --no-h|grep fsck
Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication Socket. Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files, 460111/3822976 blocks
Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
sector and its backup.
Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences: (offset:original/backup)
Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
25815/130554 clusters
Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.
both already done. Not closng it was the cause of a memory leak but I
fixed that a fortnight ago.
I am beginning to wonder if I did more damage than just the power socket
when I trod on it.
On 14/09/2023 16:29, Ralf Fassel wrote:
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.
R'
an hour, it crashed again.
The filename is built by scanning a directory so the filename must exist.
The Natural Philosopher <tnp@invalid.invalid> wrote:
I am leaning towards possibly a cracked solder joint or board.
Have you run fsck on the file system since the power loss? Make sure the fstab
entry does not have a zero in the sixth field for the file system(s) in use. If using systemd, run dracut -f after any fstab changes. Then reboot.
On 14/09/2023 09:23, Richard Kettlewell wrote:
Also:Nothing in kern.log after the boot process finishes.
* I would also have a look at the kernel log; if it’s a
kernel-generated signal then there’s usually a log message about it.
* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it
is a memory corruption issue then it may identify where the corruption
happens, rather than the later point where malloc failed a consistency
check (or whatever it is).
Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.
The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.
I am not sure valgrind would actually help unless it failed.
On Thu, 14 Sep 2023 14:57:36 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:
On 14/09/2023 19:53, David W. Hodgins wrote:
journalctl -b --no-h|grep fsck
Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication
Socket.
Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files,
460111/3822976 blocks
Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
sector and its backup.
Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences:
(offset:original/backup)
Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly
unmounted and some data may be corrupt.
Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
25815/130554 clusters
Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.
If there are any corrupted files, diagnosing any problems they cause
will be
difficult. I strongly recommend re-installing.
Regards, Dave Hodgins
On 9/14/23 13:42, The Natural Philosopher wrote:
I assumed that the thing would have done its own fsck on every boot
anyway...isnt that a debian default?
Pretty sure it's a standard, my arch install has it set.
(The sixth fields are 2 and 1 respectively for the file systems)
PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2
PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1
1 is fsck check for the root partition and 2 is for others, right
On 2023-09-14, The Natural Philosopher <tnp@invalid.invalid> wrote:
On 14/09/2023 16:29, Ralf Fassel wrote:
* The Natural Philosopher <tnp@invalid.invalid>Ralf, I already put that in this morning, re compiled the code and after
| One possibility is that it is opening and reading a file at the
| precise time another process is writing it...in both cases the read
| and write
| operations are atomic and done with C code.
| READ
| ====
| fp=fopen(fullname, "r");
| len=fread(filbuf,1,255,fp); // read entire file
Check for fp != NULL is missing here in this example code before
fread(). If this also in the production version, it might be a problem
if the file is not accessible for any reason.
R'
an hour, it crashed again.
The filename is built by scanning a directory so the filename must exist.
Maybe not applicable in this situation, but if something deleted
the file between the time of the scan and the time of the fopen
call, it might/would not exist.
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
both already done. Not closng it was the cause of a memory leak but I
fixed that a fortnight ago.
I am beginning to wonder if I did more damage than just the power socket
when I trod on it.
SIGABRT is a problem in your code.
log then it almost certainly isn't a hardware fault. It is a very special skill to have a hardware fault without spewing lots of stuff there.
Post the code somewhere and someone can take a look. Otherwise you need to use the development tools available to you to debug the problem.
Theo
The Natural Philosopher <tnp@invalid.invalid> writes:
On 14/09/2023 09:23, Richard Kettlewell wrote:
Also:Nothing in kern.log after the boot process finishes.
* I would also have a look at the kernel log; if it’s a
kernel-generated signal then there’s usually a log message about it. >>>
Most likely a bug in your program then.
* Run the application under valgrind; depending what the issue is, that
will provide a backtrace and perhaps more detailed information. If it >>> is a memory corruption issue then it may identify where the corruption >>> happens, rather than the later point where malloc failed a consistency >>> check (or whatever it is).
Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.
The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.
I am not sure valgrind would actually help unless it failed.
It’s extremely good at identifying memory corruption even in cases where that doesn’t immediately lead to a crash; that’s what it’s for. But if it doesn’t, you leave it running until the crash happens.
Up to you, of course, whether you use the tools available, or debug with
one hand tied behind your back.
* The Natural Philosopher <tnp@invalid.invalid>
| On 14/09/2023 16:29, Ralf Fassel wrote:
| > * The Natural Philosopher <tnp@invalid.invalid>
| > | One possibility is that it is opening and reading a file at the
| > | precise time another process is writing it...in both cases the read
| > | and write
| > | operations are atomic and done with C code.
| >>
| > | READ
| > | ====
| > | fp=fopen(fullname, "r");
| > | len=fread(filbuf,1,255,fp); // read entire file
| > Check for fp != NULL is missing here in this example code before
| > fread(). If this also in the production version, it might be a problem
| > if the file is not accessible for any reason.
| > R'
| Ralf, I already put that in this morning, re compiled the code and
| after an hour, it crashed again.
| The filename is built by scanning a directory so the filename must exist.
That assumption does not hold. Since scanning and opening are separated
by a time gap (albeit a 'small' one), there is a non-zero chance that
the file vanished between scan and open.
Further possibilities:
- how is 'filbuf' used after the fread()? If you use it as C-string, make
sure it is 0-terminated (fread() won't do that for you). Maybe use
fgets(3) instead?
| I am leaning towards possibly a cracked solder joint or board.
Well, since the Raspi is cheap, that should be easily checked by simply
using another one. I bet 1 beer that it is *not* a cracked board, since
with that many more processes should run into trouble, not only this particular one.
R' (.sig not from me .-)
Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?
And that means thousands of faultless iterations in a day.
So this bug ( if it is a bug) is a one in a million or worse.
I suppose I could make the thing loop ten times a second (or even
faster) and see if it happens more often..
its not as though its chewing up CPU...
The problem I have is that these crashes only recently started
happening: prior to that the code ran for days. And two things happened,
a massive brownout, and then a full power cut, and I trod on it.
And I made systemd start it...
On 15/09/2023 08:30, Richard Kettlewell wrote:
The Natural Philosopher <tnp@invalid.invalid> writes:
I am not sure valgrind would actually help unless it failed.It’s extremely good at identifying memory corruption even in cases
where that doesn’t immediately lead to a crash; that’s what it’s for. >> But if it doesn’t, you leave it running until the crash happens.
Well that is an option for sure.
Up to you, of course, whether you use the tools available, or debug with
one hand tied behind your back.
Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?
dir = opendir(VOLATILE_DIR);
if(!dir)
return;
while ((dp = readdir (dir)) != NULL)
{
filename=dp->d_name;
// skip known bollocks
if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
|| !strcmp(filename, "relays.dat" ))
continue;
// construct full path
sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);
stat(fullname,&stats);// get tfile times
if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
continue;
len=strlen(filename);
if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
continue;
fp=fopen(fullname, "r");
if(fp==0) //file has disappeared?
continue;
len=fread(filbuf,1,255,fp);
Assert that 'i' is in the valid index range here, before using it asindex into other arrays.
The Natural Philosopher <tnp@invalid.invalid> wrote:
Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?
Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).
On 15/09/2023 08:30, Richard Kettlewell wrote:
The Natural Philosopher <tnp@invalid.invalid> writes:Well that is an option for sure.
On 14/09/2023 09:23, Richard Kettlewell wrote:
Also:Nothing in kern.log after the boot process finishes.
* I would also have a look at the kernel log; if it’s a
kernel-generated signal then there’s usually a log message about it.
Most likely a bug in your program then.
* Run the application under valgrind; depending what the issue is, that >>>> will provide a backtrace and perhaps more detailed information. >>>> If it
is a memory corruption issue then it may identify where the
corruption
happens, rather than the later point where malloc failed a
consistency
check (or whatever it is).
Using valgrind (and/or compiler sanitizer features) is a good idea
even before running into trouble, really.
The strange thing is that it failed once after a minute, then I
rebooted and it failed after 20 minutes, and its been running several
days now with no issues at all.
I am not sure valgrind would actually help unless it failed.
It’s extremely good at identifying memory corruption even in cases where >> that doesn’t immediately lead to a crash; that’s what it’s for. But if
it doesn’t, you leave it running until the crash happens.
What are a well-known class of bugs are concurrency/timing races and memory safety violations. Which is odds-on what's happening here, especially given we've already picked up on potentially risky code like failing to check for NULL from fopen().
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?
Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).
The Natural Philosopher <tnp@invalid.invalid> writes:The filenames never change length.
dir = opendir(VOLATILE_DIR);
if(!dir)
return;
while ((dp = readdir (dir)) != NULL)
{
filename=dp->d_name;
// skip known bollocks
if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
|| !strcmp(filename, "relays.dat" ))
continue;
// construct full path
sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);
Possible write overrun here.
stat(fullname,&stats);// get tfile times
if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
continue;
len=strlen(filename);
if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
continue;
Possible read under-run here. (But if it crashes then you’d expect
SIGSEGV rather than SIGABRT, so that’s probably not the issue.)
fp=fopen(fullname, "r");
if(fp==0) //file has disappeared?
continue;
len=fread(filbuf,1,255,fp);
I don’t think the declaration of filbuf has been posted, so there’s a possible write overrun if it’s less than 255 bytes.
Theo <theom+news@chiark.greenend.org.uk> writes:
The Natural Philosopher <tnp@invalid.invalid> wrote:
Tell me in what way a corrupted - say - libc file, or a faulty bit of
memory would show up in the kernel logs?
Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's
possible to bit-flip in RAM or storage without it noticing. I don't know
which part of the galaxy you inhabit, but cosmic rays are rare enough down >> here that random bit flips like this don't happen often - ballpark once a
year for a server (which has a much greater surface area to absorb them than >> a Pi).
I’ve seen one inarguable random bit flip in several decades. In that
case the behavior was deterministic - chiark’s /bin/ls had got a
single-bit error, and caching meant it crashed _every_ time anyone ran
it.
Maybe TNP has taken a trip to Sizewell?
On 14/09/2023 20:40, candycanearter07 wrote:
On 9/14/23 13:42, The Natural Philosopher wrote:I looked it up, it merely specifies the order I think, so you are right
I assumed that the thing would have done its own fsck on every boot
anyway...isnt that a debian default?
Pretty sure it's a standard, my arch install has it set.
(The sixth fields are 2 and 1 respectively for the file systems)
PARTUUID=b8c9fbb7-01 /boot vfat defaults
0 2
PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime
0 1
1 is fsck check for the root partition and 2 is for others, right
in practice.
You trust the contents of 'outside'-files very much, do you? ;-)
I don't know who can create files in the directory you're scanning, but
not *assuring* the input you expect is another possible cause for
problems...
* The Natural Philosopher <tnp@invalid.invalid>
| > Further possibilities:
| > - how is 'filbuf' used after the fread()? If you use it as C-string, make
| > sure it is 0-terminated (fread() won't do that for you). Maybe use
| > fgets(3) instead?
| >
| dir = opendir(VOLATILE_DIR);
| if(!dir)
| return;
| while ((dp = readdir (dir)) != NULL)
[looks good, error checks for stat() et al couldn't hurt]
--<snip-snip>--
| if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
| a file whose contents do not start with ZONE
| goto baddata;
|
| // looking very much like a temperature file
| i=(int)filbuf[4] -'1'; // this is our zone from
| "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
| '1'
The access of filbuf[4] is ok (since you checked that there are at least
4 characters in the file), but what if nothing follows after the 'ZONE',
or ZONE is followed by anything but [1-4]?
Assert that 'i' is in the valid index range here, before using it asindex into other arrays.
| p=strstr(filbuf,"\n");
| if(p)
| {
| p++;
| if(q=strstr(p,"\n"))
| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure
Memory leak if thermometers[i].name already contains something.
Other than that, I really would have it running under a debugger or
valgrind, since then *if* it crashes, you *know* *where* in your code it crashes.
Good luck hunting!
R'
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store >> values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.
Theo
On 15/09/2023 12:12, Ralf Fassel wrote:
| {
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure
Memory leak if thermometers[i].name already contains something.
further up the line...
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to store >> values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
get such behaviour.
Theo
* The Natural Philosopher <tnp@invalid.invalid>
| > | thermometers[i].name=strdup(p); //
| > | make a copy of the name and attach it
| > | to our thermometer structure
| > Memory leak if thermometers[i].name already contains something.
| >
| further up the line...
| bzero(filbuf,sizeof(filbuf));
| /** first thing to do is clean any allocated memory used to
| store values. **/
| for(i=0;i<NUMBER_RELAYS;i++)
| free(thermometers[i].name);
Note that the assignment
thermometers[i].name=strdup(p);
is *inside* the while() loop without a free().
Probably you argue that there ever is only a single file to read in that
dir anyway... Personally, I've been bitten by such assumptions, so I'd rather check once too often than hunting down "can't happen" bugs.
R'
I had assumed that freeing a pointer that already had been freed would
either result in a NO-OP because the pointer no longer existed in the
heap memory allocation tables, or it would instantly crash , but it
seems that the action is 'undefined'.
On 9/15/23 08:23, Theo wrote:
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it
| to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to
store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you
call strdup() exactly once, and subsequently free() exactly once? If
there
was some code path that was breaking out of the loop or similar you might
get such behaviour.
Theo
I thought double free was a SIGSEGV?
On 9/15/23 08:23, Theo wrote:
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.
I thought double free was a SIGSEGV?
On 15/09/2023 14:23, Theo wrote:
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>Hmm. I free the pointers even for relay zones that don't have
wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it |
to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to
store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i
you call strdup() exactly once, and subsequently free() exactly once?
If there was some code path that was breaking out of the loop or
similar you might get such behaviour.
thermometers, whose pointers are 0. That isn't an issue.
But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.
Its used elsewhere Well I don't think its an issue, but I can zero the pointers anyway after free()ing
Theo
Personally, I want to run with full debug, stack trace, logging,
exception handling, and bounds checking turned on all the time, even in production. Which is why I generally use a modern language like C# or
Java.
I thought double free was a SIGSEGV?
candycanearter07 <no@thanks.net> writes:
On 9/15/23 08:23, Theo wrote:
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.
I thought double free was a SIGSEGV?
If Glibc detects it you’ll get a diagnostic and SIGABRT.
If it doesn’t detect it then anything could happen - SIGSEGV is just one possibility.
On 15/09/2023 15:27, Ralf Fassel wrote:
Note that the assignmentI do think that what has happened is that a valid file name has been
thermometers[i].name=strdup(p);
is *inside* the while() loop without a free().
Probably you argue that there ever is only a single file to read in
that dir anyway... Personally, I've been bitten by such
assumptions, so I'd rather check once too often than hunting down
"can't happen" bugs.
found with empty data, or no file at all, and then no strdup is done
- but the free is, next time around.
That should never happen of course, as the fopen/fwrite sequence
should certainly not delete the filename, but it is entirely possible
that a the fopen *truncates* its data. At which point we cant strdup anything, so the next free gets a woopsie
In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:
On 15/09/2023 15:27, Ralf Fassel wrote:
Note that the assignmentI do think that what has happened is that a valid file name has been
thermometers[i].name=strdup(p);
is *inside* the while() loop without a free().
Probably you argue that there ever is only a single file to read in
that dir anyway... Personally, I've been bitten by such
assumptions, so I'd rather check once too often than hunting down
"can't happen" bugs.
found with empty data, or no file at all, and then no strdup is done
- but the free is, next time around.
That should never happen of course, as the fopen/fwrite sequence
should certainly not delete the filename, but it is entirely possible
that a the fopen *truncates* its data. At which point we cant strdup
anything, so the next free gets a woopsie
Are the "files" being written to by an independent process separate
from this reading process?
If yes, are you doing any form of locking/synchronization to prevent
the reading process from trying to read from a file that a writing
process has open/truncated, but not yet written any data into?
If no, then you may be also hitting a race condition where the stars
align just right, the writer has just performed its fopen/truncate
(leaving the file empty) and the kernel decides to context switch away
to the reader at that point, before the writer can write and close the
file. The reader will then see an empty file.
The classic "lock free" solution to this one is for the writer to
create and write to a temporary file, and after closing the temp file
to rename() it to the name of the real file. Rename is documented to
be atomic, so the reader would never see a half open, or partially
complete, file in this case.
On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:
On 15/09/2023 14:23, Theo wrote:
In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>Hmm. I free the pointers even for relay zones that don't have
wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it | >>>>> to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used to
store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each i
you call strdup() exactly once, and subsequently free() exactly once?
If there was some code path that was breaking out of the loop or
similar you might get such behaviour.
thermometers, whose pointers are 0. That isn't an issue.
But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.
Its used elsewhere Well I don't think its an issue, but I can zero the
pointers anyway after free()ing
Theo
Hi, read the thread with interest.
If you're getting SIGABRT, that's almost always the software
calling abort(3). If you aren't, maybe there's a library calling it?
$ man 7 signal
[...]
Signal Standard Action Comment
SIGABRT P1990 Core Abort signal from abort(3)
[but it also lists]
SIGIOT - Core IOT trap. A synonym for SIGABRT
_ _ _ _ _ _ _
Meanwhile, if you want to avoid locking your file, you might want to write
a fresh file with a unique name, then rename() it,
which -- please correct me if I'm wrong -- should replace
the desired file atomically.
* The Natural Philosopher <tnp@invalid.invalid>file-handle-on-linux-if-the-pointed-file-gets-moved-or-d
| On 15/09/2023 16:12, vallor wrote:
| > Meanwhile, if you want to avoid locking your file, you might want to
| > write | > a fresh file with a unique name, then rename() it,
| > which -- please correct me if I'm wrong -- should replace | > the
desired file atomically.
| I think the consensus is that it does.
| Presumably if the read process has the old file open, that will be |
valid until it closes it?
On Linux: yes. Once a process has a file open, it sees the 'old'
contents if the file is removed from disk.
https://stackoverflow.com/questions/2028874/what-happens-to-an-open-
R'
On 15/09/2023 16:12, vallor wrote:
On Fri, 15 Sep 2023 14:56:23 +0100, The Natural PhilosopherI think the consensus is that it does.
<tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:
On 15/09/2023 14:23, Theo wrote:
In comp.sys.raspberry-pi The Natural PhilosopherHmm. I free the pointers even for relay zones that don't have
<tnp@invalid.invalid> wrote:
On 15/09/2023 12:12, Ralf Fassel wrote:
| {further up the line...
| *q++=0;
| thermometers[i].name=strdup(p); //
| make a copy of the name and attach it >>>>>> |
to our thermometer structure
Memory leak if thermometers[i].name already contains something.
bzero(filbuf,sizeof(filbuf));
/** first thing to do is clean any allocated memory used
to store
values. **/
for(i=0;i<NUMBER_RELAYS;i++)
free(thermometers[i].name);
You could get a SIGABRT if you were trying to free something that was
already freed. Are you sure those are interlocked such that for each
i you call strdup() exactly once, and subsequently free() exactly
once? If there was some code path that was breaking out of the loop
or similar you might get such behaviour.
thermometers, whose pointers are 0. That isn't an issue.
But that might be a remotely possible issue. I dont zero the pointers
after freeing them as far as I can tell. The silly thing is that this
program doesn't use the name anyway.
Its used elsewhere Well I don't think its an issue, but I can zero the
pointers anyway after free()ing
Theo
Hi, read the thread with interest.
If you're getting SIGABRT, that's almost always the software calling
abort(3). If you aren't, maybe there's a library calling it?
$ man 7 signal [...]
Signal Standard Action Comment SIGABRT P1990
Core Abort signal from abort(3)
[but it also lists]
SIGIOT - Core IOT trap. A synonym for SIGABRT
_ _ _ _ _ _ _
Meanwhile, if you want to avoid locking your file, you might want to
write a fresh file with a unique name, then rename() it,
which -- please correct me if I'm wrong -- should replace the desired
file atomically.
Presumably if the read process has the old file open, that will be valid until it closes it?
On 15/09/2023 16:26, Rich wrote:
Are the "files" being written to by an independent process separateYes
from this reading process?
If yes, are you doing any form of locking/synchronization to preventNo.
the reading process from trying to read from a file that a writing
process has open/truncated, but not yet written any data into?
If no, then you may be also hitting a race condition where the starsI think that is exactly the case. I didnt think that was in fact
align just right, the writer has just performed its fopen/truncate
(leaving the file empty) and the kernel decides to context switch
away to the reader at that point, before the writer can write and
close the file. The reader will then see an empty file.
possible
The classic "lock free" solution to this one is for the writer to
create and write to a temporary file, and after closing the temp file
to rename() it to the name of the real file. Rename is documented to
be atomic, so the reader would never see a half open, or partially
complete, file in this case.
Yes, I was just wondering that before I read this post. Rename unlinks
the old file does it?
I might implement that, as well. It doesn't really matter however,
as in practice the structures than contain thermometer data don't get
altered if no valid data is found, so the lack of a proper file, ex
of causing a crash, now simply means the (unused in this program)
name data gets erased. For a few seconds. It simply misses a
reading and uses last times data for everything else. Mostly the temperature.
Sysop: | Coz |
---|---|
Location: | Anoka, MN |
Users: | 2 |
Nodes: | 4 (0 / 4) |
Uptime: | 140:07:14 |
Calls: | 166 |
Files: | 5,389 |
Messages: | 223,236 |