• Weird code crash

    From The Natural Philosopher@3:770/3 to All on Thu Sep 14 06:23:14 2023
    XPost: comp.os.linux.misc

    I don't expect people to know the answer, but I could use some help in
    puzzling out where to look.

    I had a power cut that did leave my network a bit sketchy and it took
    two reboots on this desktop to get back to normal. This may or may not
    be relevant.

    But my question refers to my Pi Zero W server I am developing.

    It came up, ok, but then after a while my relay daemon crashed...

    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
    process exit
    ed, code=killed, status=6/ABRT
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
    with resul
    t 'signal'.
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
    15.074s
    CPU time.

    I rebooted it, and after awhile - about ten minutes, it happened again -
    that is the above trace.

    I restarted it manually, and it hasn't crashed since.

    The web is flooded with instances of this messaqe all on different
    platforms and applications, and it would appear this is a very generic
    message possibly to do with memory issues.

    One person 'fixed' it by changing CPUs...
    Now *as far as I know* there was nothing special about the data the
    daemon would be operating on it this point to cause it to crash. I am
    fairly sure I have no memory leaks in it - in normal operation it
    strdups() and frees() and opens and closes files... and 'top' shows
    memory usage is rock steady.

    One possibility is that it is opening and reading a file at the precise
    time another process is writing it...in both cases the read and write operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");
    len=fread(filbuf,1,255,fp); // read entire file

    WRITE
    =====
    fp=fopen(filename, "w");
    if (fp)
    {
    fprintf(fp,"%s%s\n",filedata,timestamp);
    fclose(fp);
    }

    Could this cause a problem?

    I tend to suspect some sort of asynchronous timing issue because it is
    such a rare occurrence. I have been utterly unable to make it happen on demand...


    --
    A lie can travel halfway around the world while the truth is putting on
    its shoes.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ahem A Rivet's Shot@3:770/3 to The Natural Philosopher on Thu Sep 14 07:09:14 2023
    XPost: comp.os.linux.misc

    On Thu, 14 Sep 2023 06:23:15 +0100
    The Natural Philosopher <tnp@invalid.invalid> wrote:

    One possibility is that it is opening and reading a file at the precise
    time another process is writing it...in both cases the read and write operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");

    Anything opened with fopen is a buffered stream operations on it
    are not atomic so yes it is very possible for the read to see a partially written file. To avoid the race you need to use some kind of locking.

    --
    Steve O'Hara-Smith
    Odds and Ends at http://www.sohara.org/
    Host: Beautiful Theory meet Inconvenient Fact
    Obit: Beautiful Theory died today of factual inconsistency

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ahem A Rivet's Shot on Thu Sep 14 07:57:44 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 07:09, Ahem A Rivet's Shot wrote:
    On Thu, 14 Sep 2023 06:23:15 +0100
    The Natural Philosopher <tnp@invalid.invalid> wrote:

    One possibility is that it is opening and reading a file at the precise
    time another process is writing it...in both cases the read and write
    operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");

    Anything opened with fopen is a buffered stream operations on it
    are not atomic so yes it is very possible for the read to see a partially written file. To avoid the race you need to use some kind of locking.

    Hmm.

    Howver I think that for small operations one would have to posit a time
    between fopen() and fread() in which the file 'disappears' in some
    sense. Burt I 8thought* that a file handle once issued would not point
    to empty data, and that in fact fopen('w") would in fact create a new
    file and the old would not get unlinked until it was 'fclosed'
    --
    "Corbyn talks about equality, justice, opportunity, health care, peace, community, compassion, investment, security, housing...."
    "What kind of person is not interested in those things?"

    "Jeremy Corbyn?"

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to The Natural Philosopher on Thu Sep 14 08:45:58 2023
    XPost: comp.os.linux.misc

    The Natural Philosopher <tnp@invalid.invalid> writes:
    READ
    ====
    fp=fopen(fullname, "r");
    len=fread(filbuf,1,255,fp); // read entire file

    There’s no error checking on the call to fopen, so fp could be a null
    pointer when you call fread. So crashes are to be expected, although in
    this code fragment a SIGSEGV would be expected rather than SIGABRT.

    WRITE
    =====
    fp=fopen(filename, "w");
    if (fp)
    {
    fprintf(fp,"%s%s\n",filedata,timestamp);
    fclose(fp);
    }

    Could this cause a problem?

    I tend to suspect some sort of asynchronous timing issue because it is
    such a rare occurrence. I have been utterly unable to make it happen
    on demand...

    Investigate properly first (see Theo’s post), guess about the cause
    later.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 08:36:06 2023
    XPost: comp.os.linux.misc

    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
    process exit
    ed, code=killed, status=6/ABRT
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
    with resul
    t 'signal'.
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
    CPU time.

    I rebooted it, and after awhile - about ten minutes, it happened again -
    that is the above trace.

    I restarted it manually, and it hasn't crashed since.

    The web is flooded with instances of this messaqe all on different
    platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.

    You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a
    double-free.

    I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

    $ gdb ./myprog
    (gdb) run

    and see if it dies. If it does you can get a backtrace to indicate where
    the fault occurred:

    (gdb) bt

    It may be that starting it under systemd is different in some way that it doesn't show up when running it by hand. You could try setting as your
    systemd command:

    gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

    which will run it and then dump a backtrace when it's finished. You may get 'no stack' if it succeeded and didn't record one.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Tauno Voipio@3:770/3 to The Natural Philosopher on Thu Sep 14 10:55:34 2023
    XPost: comp.os.linux.misc

    The first try should be to check if the system runs fine from a
    backup memory card (you have it?).

    It is fairly possible that the memory card has some flipped bits,
    and the effects are hard to predict.

    --

    -TV


    On 14.9.2023 8.23, The Natural Philosopher wrote:
    I don't expect people to know the answer, but I could use some help in puzzling out where to look.

    I had a power cut that did leave my network a bit sketchy and it took
    two reboots on this desktop to get back to normal.  This may or may not
    be relevant.

    But my question refers to my Pi  Zero W server I am developing.

    It came up, ok, but then after a while my relay daemon crashed...

    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
    process exit
    ed, code=killed, status=6/ABRT
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
    with resul
    t 'signal'.
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed 15.074s
    CPU time.

    I rebooted it, and after awhile - about ten minutes, it happened again -
    that is the above trace.

    I restarted it manually, and it hasn't crashed since.

    The web is flooded with instances of this messaqe all on different
    platforms and applications, and it would appear this is a very generic message possibly to do with memory issues.

    One person 'fixed' it by changing CPUs...
    Now *as far as I know* there was nothing special about the data the
    daemon would be operating on it this point to cause it to crash. I am
    fairly sure I have no memory leaks in it - in normal operation it
    strdups() and frees()  and opens and closes files... and 'top' shows
    memory usage is rock steady.

    One possibility is that it is opening and reading a file at the precise
    time another process is writing it...in both cases the read and write operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");
    len=fread(filbuf,1,255,fp); // read entire file

    WRITE
    =====
    fp=fopen(filename, "w");
    if (fp)
        {
        fprintf(fp,"%s%s\n",filedata,timestamp);
        fclose(fp);
        }

    Could this cause a problem?

    I tend to suspect some sort of asynchronous timing issue because it is
    such a rare occurrence. I have been utterly unable to make it happen on demand...



    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ahem A Rivet's Shot@3:770/3 to The Natural Philosopher on Thu Sep 14 08:52:36 2023
    XPost: comp.os.linux.misc

    On Thu, 14 Sep 2023 07:57:45 +0100
    The Natural Philosopher <tnp@invalid.invalid> wrote:

    Howver I think that for small operations one would have to posit a time between fopen() and fread() in which the file 'disappears' in some
    sense. Burt I 8thought* that a file handle once issued would not point
    to empty data, and that in fact fopen('w") would in fact create a new
    file and the old would not get unlinked until it was 'fclosed'

    Nope - from man fopen

    “w” Open for writing. The stream is positioned at the beginning of
    the file. Truncate the file to zero length if it exists or
    create the file if it does not exist.

    --
    Steve O'Hara-Smith
    Odds and Ends at http://www.sohara.org/
    Host: Beautiful Theory meet Inconvenient Fact
    Obit: Beautiful Theory died today of factual inconsistency

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to Theo on Thu Sep 14 09:23:00 2023
    XPost: comp.os.linux.misc

    Theo <theom+news@chiark.greenend.org.uk> writes:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
    process exit
    ed, code=killed, status=6/ABRT
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
    with resul
    t 'signal'.
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
    15.074s
    CPU time.

    I rebooted it, and after awhile - about ten minutes, it happened again -
    that is the above trace.

    I restarted it manually, and it hasn't crashed since.

    The web is flooded with instances of this messaqe all on different
    platforms and applications, and it would appear this is a very generic
    message possibly to do with memory issues.

    You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.

    I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

    $ gdb ./myprog
    (gdb) run

    and see if it dies. If it does you can get a backtrace to indicate where
    the fault occurred:

    (gdb) bt

    It may be that starting it under systemd is different in some way that it doesn't show up when running it by hand. You could try setting as your systemd command:

    gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

    which will run it and then dump a backtrace when it's finished. You may get 'no stack' if it succeeded and didn't record one.

    Also:

    * I would also have a look at the kernel log; if it’s a kernel-generated
    signal then there’s usually a log message about it.

    * Run the application under valgrind; depending what the issue is, that
    will provide a backtrace and perhaps more detailed information. If it
    is a memory corruption issue then it may identify where the corruption
    happens, rather than the later point where malloc failed a consistency
    check (or whatever it is).

    Using valgrind (and/or compiler sanitizer features) is a good idea even
    before running into trouble, really.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ahem A Rivet's Shot on Thu Sep 14 12:27:52 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 08:52, Ahem A Rivet's Shot wrote:
    On Thu, 14 Sep 2023 07:57:45 +0100
    The Natural Philosopher <tnp@invalid.invalid> wrote:

    Howver I think that for small operations one would have to posit a time
    between fopen() and fread() in which the file 'disappears' in some
    sense. Burt I 8thought* that a file handle once issued would not point
    to empty data, and that in fact fopen('w") would in fact create a new
    file and the old would not get unlinked until it was 'fclosed'

    Nope - from man fopen

    “w” Open for writing. The stream is positioned at the beginning of
    the file. Truncate the file to zero length if it exists or
    create the file if it does not exist.


    Ok, so there is a finite choice that an empty (zero length) file might
    be read.
    That is worth checking .

    --
    "A point of view can be a dangerous luxury when substituted for insight
    and understanding".

    Marshall McLuhan

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Richard Kettlewell on Thu Sep 14 12:54:38 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 09:23, Richard Kettlewell wrote:
    Theo <theom+news@chiark.greenend.org.uk> writes:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Main
    process exit
    ed, code=killed, status=6/ABRT
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Failed
    with resul
    t 'signal'.
    Sep 13 11:26:36 heating-controller systemd[1]: relayd.service: Consumed
    15.074s
    CPU time.

    I rebooted it, and after awhile - about ten minutes, it happened again - >>> that is the above trace.

    I restarted it manually, and it hasn't crashed since.

    The web is flooded with instances of this messaqe all on different
    platforms and applications, and it would appear this is a very generic
    message possibly to do with memory issues.

    You're getting SIGABRT which is typically something bailing due to memory
    corruption, eg corrupting metadata so that malloc can't work, or a
    double-free.

    I would compile it with debugging enabled: '-g' or '-ggdb' flag to your
    compiler. Then run it under gdb:

    $ gdb ./myprog
    (gdb) run

    and see if it dies. If it does you can get a backtrace to indicate where
    the fault occurred:

    (gdb) bt

    It may be that starting it under systemd is different in some way that it
    doesn't show up when running it by hand. You could try setting as your
    systemd command:

    gdb -ex run -ex bt --args /usr/local/bin/myprog arg1 arg2

    which will run it and then dump a backtrace when it's finished. You may get >> 'no stack' if it succeeded and didn't record one.

    Also:

    * I would also have a look at the kernel log; if it’s a kernel-generated
    signal then there’s usually a log message about it.

    Nothing in kern.log after the boot process finishes.

    * Run the application under valgrind; depending what the issue is, that
    will provide a backtrace and perhaps more detailed information. If it
    is a memory corruption issue then it may identify where the corruption
    happens, rather than the later point where malloc failed a consistency
    check (or whatever it is).

    Using valgrind (and/or compiler sanitizer features) is a good idea even before running into trouble, really.

    The strange thing is that it failed once after a minute, then I rebooted
    and it failed after 20 minutes, and its been running several days now
    with no issues at all.

    I am not sure valgrind would actually help unless it failed.
    --
    No Apple devices were knowingly used in the preparation of this post.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From candycanearter07@3:770/3 to Theo on Thu Sep 14 07:47:30 2023
    XPost: comp.os.linux.misc

    On 9/14/23 02:36, Theo wrote:
    You're getting SIGABRT which is typically something bailing due to memory corruption, eg corrupting metadata so that malloc can't work, or a double-free.

    I would compile it with debugging enabled: '-g' or '-ggdb' flag to your compiler. Then run it under gdb:

    $ gdb ./myprog
    (gdb) run

    and see if it dies. If it does you can get a backtrace to indicate where
    the fault occurred:

    (gdb) bt

    If you have coredumps enabled, you could also do coredumpctl debug to
    enter a gdb session of the last coredump that happened.

    --
    --
    user <candycane> is generated from /dev/urandom

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 14:59:40 2023
    XPost: comp.os.linux.misc

    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    The strange thing is that it failed once after a minute, then I rebooted
    and it failed after 20 minutes, and its been running several days now
    with no issues at all.

    I am not sure valgrind would actually help unless it failed.

    valgrind will tell you if it spots memory corruption, even if the corruption
    is not yet enough to cause it to crash. It may help in making the problem clearer and deterministic where the corruption makes it unpredictable.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Theo on Thu Sep 14 16:25:14 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 14:59, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    The strange thing is that it failed once after a minute, then I rebooted
    and it failed after 20 minutes, and its been running several days now
    with no issues at all.

    I am not sure valgrind would actually help unless it failed.

    valgrind will tell you if it spots memory corruption, even if the corruption is not yet enough to cause it to crash. It may help in making the problem clearer and deterministic where the corruption makes it unpredictable.

    Theo

    I am wondering if the real reason is, that I trod on it. It is so
    utterly random that I am thinking that there may be a hardware issue
    like a cracked board. I wrecked the USB power socket for sure.

    Well a new untrodden on Pi is not the bank breaker that it might be....

    Thanks for all the helpful comments, but I am not ready to delve into
    reams of stack traces just yet.

    I think watch and see and then maybe try another board.


    --
    When plunder becomes a way of life for a group of men in a society, over
    the course of time they create for themselves a legal system that
    authorizes it and a moral code that glorifies it.

    Frédéric Bastiat

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ralf Fassel on Thu Sep 14 16:35:30 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem
    if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after
    an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist.

    The code runs as root, so there are no perms issues

    I've put in checks to avoid trying to read empty files

    I am leaning towards possibly a cracked solder joint or board.

    --
    The New Left are the people they warned you about.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From nev young@3:770/3 to The Natural Philosopher on Thu Sep 14 17:16:22 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 06:23, The Natural Philosopher wrote:
    I don't expect people to know the answer, but I could use some help in puzzling out where to look.

    One possibility is that it is opening and reading a file at the precise
    time another process is writing it...in both cases the read and write operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");
    len=fread(filbuf,1,255,fp); // read entire file

    Elsewhere in this thread it is suggested checking fp!=nul.
    Not knowing what the actual program is doing might I suggest also
    closing fp after it has been read.


    WRITE
    =====
    fp=fopen(filename, "w");
    if (fp)
        {
        fprintf(fp,"%s%s\n",filedata,timestamp);
        fclose(fp);
        }


    --
    Nev
    It causes me a great deal of regret and remorse
    that so many people are unable to understand what I write.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Thu Sep 14 17:29:46 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem
    if the file is not accessible for any reason.

    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 13:44:16 2023
    XPost: comp.os.linux.misc

    On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem
    if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after
    an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist.

    The code runs as root, so there are no perms issues

    I've put in checks to avoid trying to read empty files

    I am leaning towards possibly a cracked solder joint or board.

    Have you run fsck on the file system since the power loss? Make sure the fstab entry does not have a zero in the sixth field for the file system(s) in use.
    If using systemd, run dracut -f after any fstab changes. Then reboot.

    Regards, Dave Hodgins

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to David W. Hodgins on Thu Sep 14 19:42:26 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 18:44, David W. Hodgins wrote:
    On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread().  If this also in the production version, it might be a problem >>> if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after
    an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist.

    The code runs as root, so there are no perms issues

    I've put in checks to avoid trying to read empty files

    I am leaning towards possibly a cracked solder joint or board.

    Have you run fsck on the file system since the power loss? Make sure the fstab
    entry does not have a zero in the sixth field for the file system(s) in
    use.
    If using systemd, run dracut -f after any fstab changes. Then reboot.

    Regards, Dave Hodgins

    I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

    (The sixth fields are 2 and 1 respectively for the file systems)


    PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2 PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

    --
    Canada is all right really, though not for the whole weekend.

    "Saki"

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 14:53:20 2023
    XPost: comp.os.linux.misc

    On Thu, 14 Sep 2023 14:42:27 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 14/09/2023 18:44, David W. Hodgins wrote:
    On Thu, 14 Sep 2023 11:35:30 -0400, The Natural Philosopher
    <tnp@invalid.invalid> wrote:

    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem >>>> if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after >>> an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist. >>>
    The code runs as root, so there are no perms issues

    I've put in checks to avoid trying to read empty files

    I am leaning towards possibly a cracked solder joint or board.

    Have you run fsck on the file system since the power loss? Make sure the
    fstab
    entry does not have a zero in the sixth field for the file system(s) in
    use.
    If using systemd, run dracut -f after any fstab changes. Then reboot.

    Regards, Dave Hodgins

    I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

    (The sixth fields are 2 and 1 respectively for the file systems)


    PARTUUID=b8c9fbb7-01 /boot vfat defaults 0 2 PARTUUID=b8c9fbb7-02 / ext4 defaults,noatime 0 1

    Does it use systemd? If so, confirm it was clean with
    "journalctl -b --no-h|grep fsck"

    Regards, Dave Hodgins

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to nev young on Thu Sep 14 19:38:02 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 17:16, nev young wrote:
    On 14/09/2023 06:23, The Natural Philosopher wrote:
    I don't expect people to know the answer, but I could use some help in
    puzzling out where to look.

    One possibility is that it is opening and reading a file at the
    precise time another process is writing it...in both cases the read
    and write operations are atomic and done with C code.

    READ
    ====
    fp=fopen(fullname, "r");
    len=fread(filbuf,1,255,fp); // read entire file

    Elsewhere in this thread it is suggested checking fp!=nul.
    Not knowing what the actual program is doing might I suggest also
    closing fp after it has been read.

    both already done. Not closng it was the cause of a memory leak but I
    fixed that a fortnight ago.

    I am beginning to wonder if I did more damage than just the power socket
    when I trod on it.


    WRITE
    =====
    fp=fopen(filename, "w");
    if (fp)
         {
         fprintf(fp,"%s%s\n",filedata,timestamp);
         fclose(fp);
         }



    --
    Canada is all right really, though not for the whole weekend.

    "Saki"

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to David W. Hodgins on Thu Sep 14 19:57:36 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 19:53, David W. Hodgins wrote:
    journalctl -b --no-h|grep fsck

    Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
    Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication Socket. Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
    Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files, 460111/3822976 blocks
    Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
    Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
    sector and its backup.
    Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences: (offset:original/backup)
    Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
    Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
    Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
    Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
    Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
    Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
    Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
    25815/130554 clusters
    Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

    --
    “Those who can make you believe absurdities, can make you commit atrocities.”

    ― Voltaire, Questions sur les Miracles à M. Claparede, Professeur de Théologie à Genève, par un Proposant: Ou Extrait de Diverses Lettres de
    M. de Voltaire

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From candycanearter07@3:770/3 to The Natural Philosopher on Thu Sep 14 14:40:56 2023
    XPost: comp.os.linux.misc

    On 9/14/23 13:42, The Natural Philosopher wrote:
    I assumed that the thing would have done its own fsck on every boot anyway...isnt that a debian default?

    Pretty sure it's a standard, my arch install has it set.

    (The sixth fields are 2 and 1 respectively for the file systems)


    PARTUUID=b8c9fbb7-01  /boot           vfat    defaults          0       2
    PARTUUID=b8c9fbb7-02  /               ext4    defaults,noatime  0       1


    1 is fsck check for the root partition and 2 is for others, right

    --
    --
    user <candycane> is generated from /dev/urandom

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From David W. Hodgins@3:770/3 to The Natural Philosopher on Thu Sep 14 15:57:08 2023
    XPost: comp.os.linux.misc

    On Thu, 14 Sep 2023 14:57:36 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 14/09/2023 19:53, David W. Hodgins wrote:
    journalctl -b --no-h|grep fsck

    Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
    Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication Socket. Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
    Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files, 460111/3822976 blocks
    Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
    Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
    sector and its backup.
    Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences: (offset:original/backup)
    Sep 14 14:17:14 systemd-fsck[178]: 65:01/00
    Sep 14 14:17:14 systemd-fsck[178]: Not automatically fixing this.
    Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
    Sep 14 14:17:14 systemd-fsck[178]: Automatically removing dirty bit.
    Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
    Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
    Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
    25815/130554 clusters
    Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

    If there are any corrupted files, diagnosing any problems they cause will be difficult. I strongly recommend re-installing.

    Regards, Dave Hodgins

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Theo@3:770/3 to The Natural Philosopher on Thu Sep 14 21:51:28 2023
    XPost: comp.os.linux.misc

    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    both already done. Not closng it was the cause of a memory leak but I
    fixed that a fortnight ago.

    I am beginning to wonder if I did more damage than just the power socket
    when I trod on it.

    SIGABRT is a problem in your code. If you aren't seeing stuff in the kernel log then it almost certainly isn't a hardware fault. It is a very special skill to have a hardware fault without spewing lots of stuff there.

    Post the code somewhere and someone can take a look. Otherwise you need to
    use the development tools available to you to debug the problem.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Robert Riches@3:770/3 to The Natural Philosopher on Fri Sep 15 00:40:18 2023
    On 2023-09-14, The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem
    if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after
    an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist.

    Maybe not applicable in this situation, but if something deleted
    the file between the time of the scan and the time of the fopen
    call, it might/would not exist.

    --
    Robert Riches
    spamtrap42@jacob21819.net
    (Yes, that is one of my email addresses.)

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to David W. Hodgins on Fri Sep 15 08:20:54 2023
    XPost: comp.os.linux.misc

    "David W. Hodgins" <dwhodgins@nomail.afraid.org> writes:
    The Natural Philosopher <tnp@invalid.invalid> wrote:
    I am leaning towards possibly a cracked solder joint or board.

    Again, I agree with Theo. Reported behavior is not really consistent
    with a hardware fault.

    Have you run fsck on the file system since the power loss? Make sure the fstab
    entry does not have a zero in the sixth field for the file system(s) in use. If using systemd, run dracut -f after any fstab changes. Then reboot.

    Reported behavior is also not consistent with a corrupt filesystem.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 08:30:24 2023
    XPost: comp.os.linux.misc

    The Natural Philosopher <tnp@invalid.invalid> writes:
    On 14/09/2023 09:23, Richard Kettlewell wrote:
    Also:
    * I would also have a look at the kernel log; if it’s a
    kernel-generated signal then there’s usually a log message about it.

    Nothing in kern.log after the boot process finishes.

    Most likely a bug in your program then.

    * Run the application under valgrind; depending what the issue is, that
    will provide a backtrace and perhaps more detailed information. If it
    is a memory corruption issue then it may identify where the corruption
    happens, rather than the later point where malloc failed a consistency
    check (or whatever it is).

    Using valgrind (and/or compiler sanitizer features) is a good idea
    even before running into trouble, really.

    The strange thing is that it failed once after a minute, then I
    rebooted and it failed after 20 minutes, and its been running several
    days now with no issues at all.

    I am not sure valgrind would actually help unless it failed.

    It’s extremely good at identifying memory corruption even in cases where
    that doesn’t immediately lead to a crash; that’s what it’s for. But if it doesn’t, you leave it running until the crash happens.

    Up to you, of course, whether you use the tools available, or debug with
    one hand tied behind your back.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 11:11:00 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | On 14/09/2023 16:29, Ralf Fassel wrote:
    | > * The Natural Philosopher <tnp@invalid.invalid>
    | > | One possibility is that it is opening and reading a file at the
    | > | precise time another process is writing it...in both cases the read
    | > | and write
    | > | operations are atomic and done with C code.
    | >>
    | > | READ
    | > | ====
    | > | fp=fopen(fullname, "r");
    | > | len=fread(filbuf,1,255,fp); // read entire file
    | > Check for fp != NULL is missing here in this example code before
    | > fread(). If this also in the production version, it might be a problem
    | > if the file is not accessible for any reason.
    | > R'
    | Ralf, I already put that in this morning, re compiled the code and
    | after an hour, it crashed again.

    | The filename is built by scanning a directory so the filename must exist.

    That assumption does not hold. Since scanning and opening are separated
    by a time gap (albeit a 'small' one), there is a non-zero chance that
    the file vanished between scan and open.

    Further possibilities:
    - how is 'filbuf' used after the fread()? If you use it as C-string, make
    sure it is 0-terminated (fread() won't do that for you). Maybe use
    fgets(3) instead?

    | I am leaning towards possibly a cracked solder joint or board.

    Well, since the Raspi is cheap, that should be easily checked by simply
    using another one. I bet 1 beer that it is *not* a cracked board, since
    with that many more processes should run into trouble, not only this
    particular one.

    R' (.sig not from me .-)
    --
    echo '[ bottles of beer]sa[ bottle of beer]sb[ take one down, pass it around ]sd[ on the wall]sc[no more]se99snlc[lalnpsnPplalnp1-snpldPln1=ylnpsnPp[]pst ln0<x]sx[salblnpsnPplblnpsnpldPleplaPlcpq]sylxx' | dc

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to David W. Hodgins on Fri Sep 15 10:15:40 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 20:57, David W. Hodgins wrote:
    On Thu, 14 Sep 2023 14:57:36 -0400, The Natural Philosopher <tnp@invalid.invalid> wrote:

    On 14/09/2023 19:53, David W. Hodgins wrote:
    journalctl -b --no-h|grep fsck

    Sep 14 14:17:03 systemd[1]: Created slice system-systemd\x2dfsck.slice.
    Sep 14 14:17:03 systemd[1]: Listening on fsck to fsckd communication
    Socket.
    Sep 14 14:17:04 systemd-fsck[109]: e2fsck 1.46.2 (28-Feb-2021)
    Sep 14 14:17:04 systemd-fsck[109]: rootfs: clean, 51075/932256 files,
    460111/3822976 blocks
    Sep 14 14:17:14 systemd-fsck[178]: fsck.fat 4.2 (2021-01-31)
    Sep 14 14:17:14 systemd-fsck[178]: There are differences between boot
    sector and its backup.
    Sep 14 14:17:14 systemd-fsck[178]: This is mostly harmless. Differences:
    (offset:original/backup)
    Sep 14 14:17:14 systemd-fsck[178]:   65:01/00
    Sep 14 14:17:14 systemd-fsck[178]:   Not automatically fixing this.
    Sep 14 14:17:14 systemd-fsck[178]: Dirty bit is set. Fs was not properly
    unmounted and some data may be corrupt.
    Sep 14 14:17:14 systemd-fsck[178]:  Automatically removing dirty bit.
    Sep 14 14:17:14 systemd-fsck[178]: *** Filesystem was changed ***
    Sep 14 14:17:14 systemd-fsck[178]: Writing changes.
    Sep 14 14:17:14 systemd-fsck[178]: /dev/mmcblk0p1: 330 files,
    25815/130554 clusters
    Sep 14 14:30:12 systemd[1]: systemd-fsckd.service: Succeeded.

    If there are any corrupted files, diagnosing any problems they cause
    will be
    difficult. I strongly recommend re-installing.

    Regards, Dave Hodgins

    If it persists I may do that, but now it is been rock steady for 20 hours.

    The actual code has been replaced because I recompiled it anyway, but
    the problem persisted after that.

    Then I twisted the board a bit, and now it hasn't failed since, No
    guarantees of course.

    Does anyone else remember Tracy Kidder's 'Soul of a New Machine'* where
    they had a wire wrapped backplane on the prototype and a strange
    intermittent bug? And the director came in, twisted the backplane and
    the bug instantly reappeared?

    One of the more curious 'bugs' I encountered was early in my software
    career, when code that I wrote suddenly went crazy, in a way in which
    the actual software as written could not possibly have caused. And only
    on one machine, equipped with a custom video capture card. We removed
    the card, but it made no difference.

    I then compared the code on the machine with the code as compiled. Two
    bytes were FFH

    I burned a new floppy and transferred the code again, and the code ran correctly.

    Then we reinstalled the video card. The code ran correctly. Then we
    copied over the code again with the video card installed. The code again
    was corrupted.

    Then the hardware guys looked at the address decide in the video card.
    It was a mass of gates one after the other. The total delay was well out
    of spec. It dawned on us that what was happening was that the DMA
    controller on the floppy was using bus addresses that were being decoded
    by the card, and then the IO request came along to access the floppy and
    those addresses were still on the bus as far as the sluglike video card
    was concerned, so it grabbed the data bus and shoved FFH on it.

    Hardware is not perfect. That is the lesson. And chasing software when
    its really hardware is a losing game.

    Anyway, I have in reserve all the great techniques suggested, but for
    now I am playing a wait and see game to see if any pattern emerges. My experience suggests that the same code running a loop in the same memory
    wont crash and burn unless there is a malloc/free mismatch, and that
    happens fairly quickly and shows in 'top'.

    This kind of weird utterly asynchronous behaviour is often hardware.
    And. since I trod on the bloody PCB, I may simple get another one and
    test that. It doesn't need to be installed till winter. There is time.
    And my PCB design for the relay and PSU module isn't back from China yet...


    *https://en.wikipedia.org/wiki/The_Soul_of_a_New_Machine . Definitely recommended if you haven't read it.



    --
    "When one man dies it's a tragedy. When thousands die it's statistics."

    Josef Stalin

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to All on Fri Sep 15 10:16:34 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 20:40, candycanearter07 wrote:
    On 9/14/23 13:42, The Natural Philosopher wrote:
    I assumed that the thing would have done its own fsck on every boot
    anyway...isnt that a debian default?

    Pretty sure it's a standard, my arch install has it set.

    (The sixth fields are 2 and 1 respectively for the file systems)


    PARTUUID=b8c9fbb7-01  /boot           vfat    defaults          0       2
    PARTUUID=b8c9fbb7-02  /               ext4    defaults,noatime  0       1


    1 is fsck check for the root partition and 2 is for others, right

    I looked it up, it merely specifies the order I think, so you are right
    in practice.


    --
    "Corbyn talks about equality, justice, opportunity, health care, peace, community, compassion, investment, security, housing...."
    "What kind of person is not interested in those things?"

    "Jeremy Corbyn?"

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Robert Riches on Fri Sep 15 10:23:40 2023
    On 15/09/2023 01:40, Robert Riches wrote:
    On 2023-09-14, The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 14/09/2023 16:29, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | One possibility is that it is opening and reading a file at the
    | precise time another process is writing it...in both cases the read
    | and write
    | operations are atomic and done with C code.

    | READ
    | ====
    | fp=fopen(fullname, "r");
    | len=fread(filbuf,1,255,fp); // read entire file

    Check for fp != NULL is missing here in this example code before
    fread(). If this also in the production version, it might be a problem
    if the file is not accessible for any reason.

    R'
    Ralf, I already put that in this morning, re compiled the code and after
    an hour, it crashed again.

    The filename is built by scanning a directory so the filename must exist.

    Maybe not applicable in this situation, but if something deleted
    the file between the time of the scan and the time of the fopen
    call, it might/would not exist.


    Exactly. That is a possibility, which I have now covered. It made no difference.

    In practice the write code that *replaces* the file is very simple. It is fopen( "w") immediately followed by
    fwrite()

    without knowing the exact code involved with the fopen("w"); I cant say
    if that actually deletes the file and creates a new one, or merely
    truncates it to zero length, or indeed just opens it and trips the
    length *after* the new data is written..



    --
    WOKE is an acronym... Without Originality, Knowledge or Education.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 10:27:16 2023
    XPost: comp.os.linux.misc

    On 14/09/2023 21:51, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    both already done. Not closng it was the cause of a memory leak but I
    fixed that a fortnight ago.

    I am beginning to wonder if I did more damage than just the power socket
    when I trod on it.

    SIGABRT is a problem in your code.

    Very definite.

    Are you sure about that?

    If you aren't seeing stuff in the kernel
    log then it almost certainly isn't a hardware fault. It is a very special skill to have a hardware fault without spewing lots of stuff there.

    Even a corrupted bit in a ram disk?

    Post the code somewhere and someone can take a look. Otherwise you need to use the development tools available to you to debug the problem.


    I can post the code, but it may not help. You need the whole system
    including the perpiherals that write, to the daemon that writes the data
    files that the daemon that crashes reads.

    At the moment it is behaving perfectly. Without a reproducible bug I can
    see no point in using a debugger.


    Theo

    --
    There is nothing a fleet of dispatchable nuclear power plants cannot do
    that cannot be done worse and more expensively and with higher carbon
    emissions and more adverse environmental impact by adding intermittent renewable energy.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 10:46:44 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 08:30, Richard Kettlewell wrote:
    The Natural Philosopher <tnp@invalid.invalid> writes:
    On 14/09/2023 09:23, Richard Kettlewell wrote:
    Also:
    * I would also have a look at the kernel log; if it’s a
    kernel-generated signal then there’s usually a log message about it. >>>
    Nothing in kern.log after the boot process finishes.

    Most likely a bug in your program then.

    * Run the application under valgrind; depending what the issue is, that
    will provide a backtrace and perhaps more detailed information. If it >>> is a memory corruption issue then it may identify where the corruption >>> happens, rather than the later point where malloc failed a consistency >>> check (or whatever it is).

    Using valgrind (and/or compiler sanitizer features) is a good idea
    even before running into trouble, really.

    The strange thing is that it failed once after a minute, then I
    rebooted and it failed after 20 minutes, and its been running several
    days now with no issues at all.

    I am not sure valgrind would actually help unless it failed.

    It’s extremely good at identifying memory corruption even in cases where that doesn’t immediately lead to a crash; that’s what it’s for. But if it doesn’t, you leave it running until the crash happens.

    Well that is an option for sure.

    Up to you, of course, whether you use the tools available, or debug with
    one hand tied behind your back.


    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?


    The problem is that this thing is looping very frequently.
    loop()
    {
    while (1)
    {
    int i;
    readThermometers();
    readZones();
    readOverrides();
    readTimerData();
    setRelayState();
    setRelays();
    usleep (1120000);
    }
    }

    And that means thousands of faultless iterations in a day.

    So this bug ( if it is a bug) is a one in a million or worse.

    I suppose I could make the thing loop ten times a second (or even
    faster) and see if it happens more often..

    its not as though its chewing up CPU...

    The problem I have is that these crashes only recently started
    happening: prior to that the code ran for days. And two things happened,
    a massive brownout, and then a full power cut, and I trod on it.

    And I made systemd start it...


    I see it crashed again last night, again with zero errors apart from
    SIGABRT...


    I will start it manually and cut systemd out.


    --
    The lifetime of any political organisation is about three years before
    its been subverted by the people it tried to warn you about.

    Anon.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 11:19:10 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 10:11, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | On 14/09/2023 16:29, Ralf Fassel wrote:
    | > * The Natural Philosopher <tnp@invalid.invalid>
    | > | One possibility is that it is opening and reading a file at the
    | > | precise time another process is writing it...in both cases the read
    | > | and write
    | > | operations are atomic and done with C code.
    | >>
    | > | READ
    | > | ====
    | > | fp=fopen(fullname, "r");
    | > | len=fread(filbuf,1,255,fp); // read entire file
    | > Check for fp != NULL is missing here in this example code before
    | > fread(). If this also in the production version, it might be a problem
    | > if the file is not accessible for any reason.
    | > R'
    | Ralf, I already put that in this morning, re compiled the code and
    | after an hour, it crashed again.

    | The filename is built by scanning a directory so the filename must exist.

    That assumption does not hold. Since scanning and opening are separated
    by a time gap (albeit a 'small' one), there is a non-zero chance that
    the file vanished between scan and open.

    Further possibilities:
    - how is 'filbuf' used after the fread()? If you use it as C-string, make
    sure it is 0-terminated (fread() won't do that for you). Maybe use
    fgets(3) instead?


    dir = opendir(VOLATILE_DIR);

    if(!dir)
    return;
    while ((dp = readdir (dir)) != NULL)
    {
    filename=dp->d_name;
    // skip known bollocks
    if(!strcmp(filename, "." ) || !strcmp(filename, ".." ) || !strcmp(filename, "relays.dat" ))
    continue;
    // construct full path
    sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);
    stat(fullname,&stats);// get tfile times
    if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
    continue;
    len=strlen(filename);
    if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
    continue;
    fp=fopen(fullname, "r");
    if(fp==0) //file has disappeared?
    continue;
    len=fread(filbuf,1,255,fp);
    if(len==0) // file has zero length
    goto baddata;
    filbuf[len]=0;
    if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject a file whose
    contents do not start with ZONE
    goto baddata;

    // looking very much like a temperature file
    i=(int)filbuf[4] -'1'; // this is our zone from "ZONE2" etc. 1-4 is
    zone but index is 0-3 so subtract '1'
    p=strstr(filbuf,"\n");
    if(p)
    {
    p++;
    if(q=strstr(p,"\n"))
    {
    *q++=0;
    thermometers[i].name=strdup(p); // make a copy of the name and
    attach it to our thermometer structure
    p=q;
    }
    else goto baddata;
    // now to fetch the temp data.
    if(q=strstr(p,"\n"))
    {
    *q++=0;
    thermometers[i].temp=atof(p);
    p=q;
    }
    else goto baddata;
    // what's left is the voltage. To hell with any crap after it
    thermometers[i].battery=atof(p);
    }
    baddata:fclose(fp);
    } // end of directory scan loop
    | I am leaning towards possibly a cracked solder joint or board.

    Well, since the Raspi is cheap, that should be easily checked by simply
    using another one. I bet 1 beer that it is *not* a cracked board, since
    with that many more processes should run into trouble, not only this particular one.

    R' (.sig not from me .-)

    --
    There is something fascinating about science. One gets such wholesale
    returns of conjecture out of such a trifling investment of fact.

    Mark Twain

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Theo@3:770/3 to The Natural Philosopher on Fri Sep 15 11:58:12 2023
    XPost: comp.os.linux.misc

    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?

    Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know
    which part of the galaxy you inhabit, but cosmic rays are rare enough down
    here that random bit flips like this don't happen often - ballpark once a
    year for a server (which has a much greater surface area to absorb them than
    a Pi).

    It is also possible to be marginal on signal integrity for PCB interconnect, but that would mostly be a design fault: either they all work or many of
    them don't. Since we don't have a lot of people complaining of the same problem, we can assume the design is not marginal in that respect.

    If computers were that unreliable they would be failing all the time - and
    we'd fit ECC to everything. That they aren't suggests bit-flip corruption isn't a problem. In general random bit-flip errors are not a statistically major source of crashes, unless you're running a hyper-redundant mainframe
    and have eliminated all the other sources.

    What are a well-known class of bugs are concurrency/timing races and memory safety violations. Which is odds-on what's happening here, especially given we've already picked up on potentially risky code like failing to check for NULL from fopen().

    And that means thousands of faultless iterations in a day.

    So this bug ( if it is a bug) is a one in a million or worse.

    I suppose I could make the thing loop ten times a second (or even
    faster) and see if it happens more often..

    That would be a useful thing to try.

    its not as though its chewing up CPU...

    The problem I have is that these crashes only recently started
    happening: prior to that the code ran for days. And two things happened,
    a massive brownout, and then a full power cut, and I trod on it.

    Most of those things would cause it to fail hard (ie not power up), rather
    than have a very rare random fault.

    And I made systemd start it...

    It is possible that being run from systemd changes the timing or environment that provokes the fault in some way, but I doubt it would be the cause of
    the fault.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 11:58:08 2023
    XPost: comp.os.linux.misc

    The Natural Philosopher <tnp@invalid.invalid> writes:
    On 15/09/2023 08:30, Richard Kettlewell wrote:
    The Natural Philosopher <tnp@invalid.invalid> writes:
    I am not sure valgrind would actually help unless it failed.
    It’s extremely good at identifying memory corruption even in cases
    where that doesn’t immediately lead to a crash; that’s what it’s for. >> But if it doesn’t, you leave it running until the crash happens.

    Well that is an option for sure.

    Up to you, of course, whether you use the tools available, or debug with
    one hand tied behind your back.

    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?

    Very dependent on the nature of the corruption. But you’ve already told
    us there’s nothing in the kernel logs.

    Anyway, not responsible for advice not taken.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to The Natural Philosopher on Fri Sep 15 12:07:48 2023
    XPost: comp.os.linux.misc

    The Natural Philosopher <tnp@invalid.invalid> writes:
    dir = opendir(VOLATILE_DIR);

    if(!dir)
    return;
    while ((dp = readdir (dir)) != NULL)
    {
    filename=dp->d_name;
    // skip known bollocks
    if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
    || !strcmp(filename, "relays.dat" ))
    continue;
    // construct full path
    sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);

    Possible write overrun here.

    stat(fullname,&stats);// get tfile times
    if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
    continue;
    len=strlen(filename);
    if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
    continue;

    Possible read under-run here. (But if it crashes then you’d expect
    SIGSEGV rather than SIGABRT, so that’s probably not the issue.)

    fp=fopen(fullname, "r");
    if(fp==0) //file has disappeared?
    continue;
    len=fread(filbuf,1,255,fp);

    I don’t think the declaration of filbuf has been posted, so there’s a possible write overrun if it’s less than 255 bytes.


    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 13:12:26 2023
    XPost: comp.os.linux.misc

    You trust the contents of 'outside'-files very much, do you? ;-)
    I don't know who can create files in the directory you're scanning, but
    not *assuring* the input you expect is another possible cause for
    problems...

    * The Natural Philosopher <tnp@invalid.invalid>
    | > Further possibilities:
    | > - how is 'filbuf' used after the fread()? If you use it as C-string, make | > sure it is 0-terminated (fread() won't do that for you). Maybe use
    | > fgets(3) instead?
    | >
    | dir = opendir(VOLATILE_DIR);

    | if(!dir)
    | return;
    | while ((dp = readdir (dir)) != NULL)
    [looks good, error checks for stat() et al couldn't hurt]
    --<snip-snip>--
    | if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
    | a file whose contents do not start with ZONE
    | goto baddata;
    |
    | // looking very much like a temperature file
    | i=(int)filbuf[4] -'1'; // this is our zone from
    | "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
    | '1'

    The access of filbuf[4] is ok (since you checked that there are at least
    4 characters in the file), but what if nothing follows after the 'ZONE',
    or ZONE is followed by anything but [1-4]?
    Assert that 'i' is in the valid index range here, before using it as
    index into other arrays.

    | p=strstr(filbuf,"\n");
    | if(p)
    | {
    | p++;
    | if(q=strstr(p,"\n"))
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it
    | to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    Other than that, I really would have it running under a debugger or
    valgrind, since then *if* it crashes, you *know* *where* in your code it crashes.

    Good luck hunting!
    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to Theo on Fri Sep 15 12:12:58 2023
    XPost: comp.os.linux.misc

    Theo <theom+news@chiark.greenend.org.uk> writes:
    The Natural Philosopher <tnp@invalid.invalid> wrote:
    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?

    Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).

    I’ve seen one inarguable random bit flip in several decades. In that
    case the behavior was deterministic - chiark’s /bin/ls had got a
    single-bit error, and caching meant it crashed _every_ time anyone ran
    it.

    Maybe TNP has taken a trip to Sizewell?

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Pancho@3:770/3 to The Natural Philosopher on Fri Sep 15 12:19:44 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 10:46, The Natural Philosopher wrote:
    On 15/09/2023 08:30, Richard Kettlewell wrote:
    The Natural Philosopher <tnp@invalid.invalid> writes:
    On 14/09/2023 09:23, Richard Kettlewell wrote:
    Also:
    * I would also have a look at the kernel log; if it’s a
       kernel-generated signal then there’s usually a log message about it.

    Nothing in kern.log after the boot process finishes.

    Most likely a bug in your program then.

    * Run the application under valgrind; depending what the issue is, that >>>>     will provide a backtrace and perhaps more detailed information. >>>> If it
        is a memory corruption issue then it may identify where the
    corruption
        happens, rather than the later point where malloc failed a
    consistency
        check (or whatever it is).

    Using valgrind (and/or compiler sanitizer features) is a good idea
    even before running into trouble, really.

    The strange thing is that it failed once after a minute, then I
    rebooted and it failed after 20 minutes, and its been running several
    days now with no issues at all.

    I am not sure valgrind would actually help unless it failed.

    It’s extremely good at identifying memory corruption even in cases where >> that doesn’t immediately lead to a crash; that’s what it’s for.  But if
    it doesn’t, you leave it running until the crash happens.

    Well that is an option for sure.


    Valgrind seems to be a modern version of Purify, which was absolutely essential, when I programmed C 30 years ago.

    Personally, I want to run with full debug, stack trace, logging,
    exception handling, and bounds checking turned on all the time, even in production. Which is why I generally use a modern language like C# or Java.

    I'm with you on Python being rubbish, but have you considered something
    like Rust? That gives you the benefit of a modern language, without
    Garbage Collection pauses (if you care), or the need for a runtime
    environment (like Python, C#, and Java).

    Even using C++, would give you exception handling. C++ won't force you
    to go too far, If you don't want to.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 13:03:52 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 11:58, Theo wrote:
    What are a well-known class of bugs are concurrency/timing races and memory safety violations. Which is odds-on what's happening here, especially given we've already picked up on potentially risky code like failing to check for NULL from fopen().

    No, I do check it.


    --
    “It is dangerous to be right in matters on which the established
    authorities are wrong.”

    ― Voltaire, The Age of Louis XIV

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Pancho@3:770/3 to Theo on Fri Sep 15 12:24:32 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 11:58, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?

    Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's possible to bit-flip in RAM or storage without it noticing. I don't know which part of the galaxy you inhabit, but cosmic rays are rare enough down here that random bit flips like this don't happen often - ballpark once a year for a server (which has a much greater surface area to absorb them than a Pi).

    Lol! I thought cosmic rays when I read this thread.

    Decades of having my nose rubbed in the shit of my own stupidity, I
    guess. :-)

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 13:18:54 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 12:07, Richard Kettlewell wrote:
    The Natural Philosopher <tnp@invalid.invalid> writes:
    dir = opendir(VOLATILE_DIR);

    if(!dir)
    return;
    while ((dp = readdir (dir)) != NULL)
    {
    filename=dp->d_name;
    // skip known bollocks
    if(!strcmp(filename, "." ) || !strcmp(filename, ".." )
    || !strcmp(filename, "relays.dat" ))
    continue;
    // construct full path
    sprintf(fullname,"%s/%s",VOLATILE_DIR,filename);

    Possible write overrun here.
    The filenames never change length.


    stat(fullname,&stats);// get tfile times
    if(time(NULL)-stats.st_ctime >1800) // skip files older than half an hour
    continue;
    len=strlen(filename);
    if(strncmp(filename+len-4, ".dat",4)) // .dat file but not relays.dat
    continue;

    Possible read under-run here. (But if it crashes then you’d expect
    SIGSEGV rather than SIGABRT, so that’s probably not the issue.)

    fp=fopen(fullname, "r");
    if(fp==0) //file has disappeared?
    continue;
    len=fread(filbuf,1,255,fp);

    I don’t think the declaration of filbuf has been posted, so there’s a possible write overrun if it’s less than 255 bytes.


    char filbuf[256];
    char fullname[256];

    The fullname is of the form

    /var/www/data/volatile/192.168.0.xx.dat

    There are no other files apart from 'relay.dat' in that directory.

    I mean you are all throwing noob bugs at me. Yes, in 1984 that's the
    sort of shit I used to write. Not these days.

    I have a drawer full of T shirts marked 'buffer overrun' 'alloc without
    free' 'fopen without fclose'.

    The fact is the memory footprint does not increase. So there are no
    obvious or simple memory leaks.

    I've absolutely covered every error case mentioned here in the one case
    of the files that get written and read every few seconds.

    It occurs to me that this behaviour started when I made it autoboot
    under systemd as well.

    Since the consensus seems to be it isn't hardware, or file corruption, I
    am trying it launched manually to see if it crashes or not.

    Systemd does seem to wrap things in resource limits, and start with a
    slightly different ENV although I cant see that any are being exceeded.

    If it wasn't a daemon I would expect it to segfault and show that on
    screen. I could run it without daemonising it as well.

    So lots of options to try.

    As well as soft debuggers.

    --
    “It is dangerous to be right in matters on which the established
    authorities are wrong.”

    ― Voltaire, The Age of Louis XIV

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 13:06:04 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 12:12, Richard Kettlewell wrote:
    Theo <theom+news@chiark.greenend.org.uk> writes:
    The Natural Philosopher <tnp@invalid.invalid> wrote:
    Tell me in what way a corrupted - say - libc file, or a faulty bit of
    memory would show up in the kernel logs?

    Well, it could be a cosmic ray. The Pi doesn't have ECC memory to it's
    possible to bit-flip in RAM or storage without it noticing. I don't know
    which part of the galaxy you inhabit, but cosmic rays are rare enough down >> here that random bit flips like this don't happen often - ballpark once a
    year for a server (which has a much greater surface area to absorb them than >> a Pi).

    I’ve seen one inarguable random bit flip in several decades. In that
    case the behavior was deterministic - chiark’s /bin/ls had got a
    single-bit error, and caching meant it crashed _every_ time anyone ran
    it.

    Maybe TNP has taken a trip to Sizewell?


    LOL!

    Nope.

    I am trying some stuff out to try and get it to fail *consistently*.

    I dont feel its hugely profitable to attempt to debug it when most of
    the time its not doing anything wrong

    --
    “It is dangerous to be right in matters on which the established
    authorities are wrong.”

    ― Voltaire, The Age of Louis XIV

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From candycanearter07@3:770/3 to The Natural Philosopher on Fri Sep 15 07:53:26 2023
    XPost: comp.os.linux.misc

    On 9/15/23 04:16, The Natural Philosopher wrote:
    On 14/09/2023 20:40, candycanearter07 wrote:
    On 9/14/23 13:42, The Natural Philosopher wrote:
    I assumed that the thing would have done its own fsck on every boot
    anyway...isnt that a debian default?

    Pretty sure it's a standard, my arch install has it set.

    (The sixth fields are 2 and 1 respectively for the file systems)


    PARTUUID=b8c9fbb7-01  /boot           vfat    defaults
    0       2
    PARTUUID=b8c9fbb7-02  /               ext4    defaults,noatime
    0       1


    1 is fsck check for the root partition and 2 is for others, right

    I looked it up, it merely specifies the order I think, so you are right
    in practice.



    Oh, the thing I learned was that you should always put root as 1 and
    everything else as 2 ^^" but that makes more sense

    --
    --
    user <candycane> is generated from /dev/urandom

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 13:24:16 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 12:12, Ralf Fassel wrote:
    You trust the contents of 'outside'-files very much, do you? ;-)
    I don't know who can create files in the directory you're scanning, but
    not *assuring* the input you expect is another possible cause for
    problems...

    * The Natural Philosopher <tnp@invalid.invalid>
    | > Further possibilities:
    | > - how is 'filbuf' used after the fread()? If you use it as C-string, make
    | > sure it is 0-terminated (fread() won't do that for you). Maybe use
    | > fgets(3) instead?
    | >
    | dir = opendir(VOLATILE_DIR);

    | if(!dir)
    | return;
    | while ((dp = readdir (dir)) != NULL)
    [looks good, error checks for stat() et al couldn't hurt]
    --<snip-snip>--
    | if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
    | a file whose contents do not start with ZONE
    | goto baddata;
    |
    | // looking very much like a temperature file
    | i=(int)filbuf[4] -'1'; // this is our zone from
    | "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
    | '1'

    The access of filbuf[4] is ok (since you checked that there are at least
    4 characters in the file), but what if nothing follows after the 'ZONE',
    or ZONE is followed by anything but [1-4]?

    That cannot happen. Its hard wired into the code that writes the file

    Assert that 'i' is in the valid index range here, before using it as
    index into other arrays.

    | p=strstr(filbuf,"\n");
    | if(p)
    | {
    | p++;
    | if(q=strstr(p,"\n"))
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it
    | to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to store values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    Other than that, I really would have it running under a debugger or
    valgrind, since then *if* it crashes, you *know* *where* in your code it crashes.

    Last resort. I have to learn how to *use* those tools.
    Right now I am working on other stuff and am content to change one thing
    at a time to see if that makes any difference.

    That is a low user time strategy.


    Good luck hunting!
    R'

    Thank you. The input has been valuable. And I now have further
    strategies in reserve.

    As with all intermittent faults, the thing you need most is a reliable
    way to make the fault occur.


    --
    "The great thing about Glasgow is that if there's a nuclear attack it'll
    look exactly the same afterwards."

    Billy Connolly

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 14:56:22 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 14:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it
    | to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to store >> values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
    get such behaviour.

    Hmm. I free the pointers even for relay zones that don't have
    thermometers, whose pointers are 0. That isn't an issue.

    But that might be a remotely possible issue. I dont zero the pointers
    after freeing them as far as I can tell. The silly thing is that this
    program doesn't use the name anyway.

    Its used elsewhere
    Well I don't think its an issue, but I can zero the pointers anyway
    after free()ing

    Theo

    --
    "I guess a rattlesnake ain't risponsible fer bein' a rattlesnake, but ah
    puts mah heel on um jess the same if'n I catches him around mah chillun".

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Theo@3:770/3 to The Natural Philosopher on Fri Sep 15 14:23:48 2023
    XPost: comp.os.linux.misc

    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it
    | to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to store values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there
    was some code path that was breaking out of the loop or similar you might
    get such behaviour.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 16:12:56 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | > | if(len=strncmp(filbuf,"ZONE",4)) //supposed to reject
    | > | a file whose contents do not start with ZONE
    | > | goto baddata;
    | > |
    | > | // looking very much like a temperature file
    | > | i=(int)filbuf[4] -'1'; // this is our zone from
    | > | "ZONE2" etc. 1-4 is zone but index is 0-3 so subtract
    | > | '1'
    | > The access of filbuf[4] is ok (since you checked that there are at
    | > least
    | > 4 characters in the file), but what if nothing follows after the 'ZONE',
    | > or ZONE is followed by anything but [1-4]?

    | That cannot happen. Its hard wired into the code that writes the file

    Depending on the permissions of VOLATILE_DIR, it *might* be possible
    that *anybody* can drop files in there. Save some "// skip known
    bollocks", you just scan every file in VOLATILE_DIR. If I were an
    attacker, I sure would try to use that vector, regardless whether the
    program in question runs with elevated permissions or not ;-)

    | > Other than that, I really would have it running under a debugger or
    | > valgrind, since then *if* it crashes, you *know* *where* in your code it
    | > crashes.
    | >
    | Last resort. I have to learn how to *use* those tools.

    With valgrind, it is as easy as putting 'valgrind' in front of the
    commandline you use to start your program. With gdb, it is a tiny bit
    more complicated, agreed. But since these tools are worth learning
    anyway for any programmer, the time invested in learning them is not
    wasted.

    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 16:27:46 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | > | thermometers[i].name=strdup(p); //
    | > | make a copy of the name and attach it
    | > | to our thermometer structure
    | > Memory leak if thermometers[i].name already contains something.
    | >
    | further up the line...

    | bzero(filbuf,sizeof(filbuf));
    | /** first thing to do is clean any allocated memory used to
    | store values. **/
    | for(i=0;i<NUMBER_RELAYS;i++)
    | free(thermometers[i].name);

    Note that the assignment

    thermometers[i].name=strdup(p);

    is *inside* the while() loop without a free().

    Probably you argue that there ever is only a single file to read in that
    dir anyway... Personally, I've been bitten by such assumptions, so I'd
    rather check once too often than hunting down "can't happen" bugs.

    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Theo on Fri Sep 15 15:32:44 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 14:23, Theo wrote:
    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
    get such behaviour.

    Well, I am not sure if that was it or not, but I deleted manually a
    thermometer file and the thing crashed instantly. That is consistent
    with the name having been set once, and then repeatedly free()ed. I then installed the code with the free()ed pointers set to NULL, and it
    *didn't* crash instantly.


    I had assumed that freeing a pointer that already had been freed would
    either result in a NO-OP because the pointer no longer existed in the
    heap memory allocation tables, or it would instantly crash , but it
    seems that the action is 'undefined'.

    Not sure that's done the trick, because I don't quite see how a file
    could ever cease to exist.

    To not exist in the first place is one thing, but once written, nothing
    should delete them.

    Unless fopen("w") does that for a fraction of a microsecond

    Or fopen("w") creates an *empty* file, in which case it is *just*
    possible that an empty file is read, no strdup was done and the pointer
    was double freed...next time around.

    Academic now anyway. Pointers all set to null after freeing. Defined
    behaviour. frees on NULL ignored.

    I'll let it run and run and see.


    --
    The biggest threat to humanity comes from socialism, which has utterly
    diverted our attention away from what really matters to our existential survival, to indulging in navel gazing and faux moral investigations
    into what the world ought to be, whilst we fail utterly to deal with
    what it actually is.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From candycanearter07@3:770/3 to Theo on Fri Sep 15 09:40:00 2023
    XPost: comp.os.linux.misc

    On 9/15/23 08:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it
    | to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to store >> values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i you call strdup() exactly once, and subsequently free() exactly once? If there was some code path that was breaking out of the loop or similar you might
    get such behaviour.

    Theo

    I thought double free was a SIGSEGV?
    --
    --
    user <candycane> is generated from /dev/urandom

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Ralf Fassel on Fri Sep 15 15:55:16 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 15:27, Ralf Fassel wrote:
    * The Natural Philosopher <tnp@invalid.invalid>
    | > | thermometers[i].name=strdup(p); //
    | > | make a copy of the name and attach it
    | > | to our thermometer structure
    | > Memory leak if thermometers[i].name already contains something.
    | >
    | further up the line...

    | bzero(filbuf,sizeof(filbuf));
    | /** first thing to do is clean any allocated memory used to
    | store values. **/
    | for(i=0;i<NUMBER_RELAYS;i++)
    | free(thermometers[i].name);

    Note that the assignment

    thermometers[i].name=strdup(p);

    is *inside* the while() loop without a free().

    Probably you argue that there ever is only a single file to read in that
    dir anyway... Personally, I've been bitten by such assumptions, so I'd rather check once too often than hunting down "can't happen" bugs.

    R'

    No. you have misunderstood how the code works.

    There are up to 4 (NUMBER_RELAYS) thermometer files in that dir, and all
    of them are read in the loop. What there shouldn't be is more than one
    file with a ZONE number the same. So no pointer gets more than one STRDUP

    If there were, it might be possible to strdup the same pointer twice.
    And the daemon would get a memory leak and crash.

    (It would be trivial to simply add a conditional that only strdups to a
    pointer if it is NULL).

    That is a possibility that could be caused by mis-configuration of the thermometers themselves.

    However they are not at this time misconfigured, so it shouldn't be the
    crash problem, although it is an issue I will consider because fat
    fingers *could* cause it.

    I do think that what has happened is that a valid file name has been
    found with empty data, or no file at all, and then no strdup is done -
    but the free is, next time around.

    That should never happen of course, as the fopen/fwrite sequence should certainly not delete the filename, but it is entirely possible that a
    the fopen *truncates* its data. At which point we cant strdup anything,
    so the next free gets a woopsie


    Setting the pointers to NULL after free() is nice defensive coding

    As is allocating memory only if the pointers are null.

    So both are in there now.


    --
    “Progress is precisely that which rules and regulations did not foresee,”

    – Ludwig von Mises

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 15:00:24 2023
    XPost: comp.os.linux.misc

    In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:
    I had assumed that freeing a pointer that already had been freed would
    either result in a NO-OP because the pointer no longer existed in the
    heap memory allocation tables, or it would instantly crash , but it
    seems that the action is 'undefined'.

    Yes, C explicitly labels "double free" as "undefined":

    <http://port70.net/~nsz/c/c99/n1256.html#J.2>

    Look under J.2 Undefined behavior (easiest is to search for "free"):

    J.2 Undefined behavior

    1 The behavior is undefined in the following circumstances:

    ...

    The pointer argument to the free or realloc function does not match
    a pointer earlier returned by calloc, malloc, or realloc, or the
    space has been deallocated by a call to free or realloc (7.20.3.2,
    7.20.3.4).

    And th 7.20.3.2 link in the page jumps to this:

    The free function causes the space pointed to by ptr to be
    deallocated, that is, made available for further allocation. If
    ptr is a null pointer, no action occurs. Otherwise, if the
    argument does not match a pointer earlier returned by the calloc,
    malloc, or realloc function, or if the space has been deallocated
    by a call to free or realloc, the behavior is undefined.

    So if by chance you are double-freeing sometimes, then you are tickling
    the undefined behaviour devil, and all bets are off as to what might
    eventually occur.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to All on Fri Sep 15 16:06:16 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 15:40, candycanearter07 wrote:
    On 9/15/23 08:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
    wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    |                             {
    |                             *q++=0;
    |                             thermometers[i].name=strdup(p); //
    |                             make a copy of the name and attach it
    |                             to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

             bzero(filbuf,sizeof(filbuf));
             /** first thing to do is clean any allocated memory used to
    store
    values. **/
             for(i=0;i<NUMBER_RELAYS;i++)
                     free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed.  Are you sure those are interlocked such that for each
    i you
    call strdup() exactly once, and subsequently free() exactly once?  If
    there
    was some code path that was breaking out of the loop or similar you might
    get such behaviour.

    Theo

    I thought double free was a SIGSEGV?

    In fact it seems fairly undefined

    It looks like it is somewhat implementation dependent. SIGSEGV means you accessed unallocated memory, but that is not the same as freeing
    allocated memory, twice.

    There seem to be instances of it reported. Google is a friend here.

    I *suspect* that if that is the problem, its a signal from deep within libc. Whereas SIGSEGV probably emanates from a memory management unit somewhere

    --
    "Strange as it seems, no amount of learning can cure stupidity, and
    higher education positively fortifies it."

    - Stephen Vizinczey

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Richard Kettlewell@3:770/3 to no@thanks.net on Fri Sep 15 16:09:18 2023
    XPost: comp.os.linux.misc

    candycanearter07 <no@thanks.net> writes:
    On 9/15/23 08:23, Theo wrote:
    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each
    i you call strdup() exactly once, and subsequently free() exactly
    once? If there was some code path that was breaking out of the loop
    or similar you might get such behaviour.

    I thought double free was a SIGSEGV?

    If Glibc detects it you’ll get a diagnostic and SIGABRT.

    If it doesn’t detect it then anything could happen - SIGSEGV is just one possibility.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From vallor@3:770/3 to tnp@invalid.invalid on Fri Sep 15 15:12:24 2023
    XPost: comp.os.linux.misc

    On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

    On 15/09/2023 14:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
    wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it |
    to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to
    store
    values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i
    you call strdup() exactly once, and subsequently free() exactly once?
    If there was some code path that was breaking out of the loop or
    similar you might get such behaviour.

    Hmm. I free the pointers even for relay zones that don't have
    thermometers, whose pointers are 0. That isn't an issue.

    But that might be a remotely possible issue. I dont zero the pointers
    after freeing them as far as I can tell. The silly thing is that this
    program doesn't use the name anyway.

    Its used elsewhere Well I don't think its an issue, but I can zero the pointers anyway after free()ing

    Theo

    Hi, read the thread with interest.

    If you're getting SIGABRT, that's almost always the software
    calling abort(3). If you aren't, maybe there's a library calling it?

    $ man 7 signal
    [...]
    Signal Standard Action Comment
    SIGABRT P1990 Core Abort signal from abort(3)
    [but it also lists]
    SIGIOT - Core IOT trap. A synonym for SIGABRT
    _ _ _ _ _ _ _

    Meanwhile, if you want to avoid locking your file, you might want to write
    a fresh file with a unique name, then rename() it,
    which -- please correct me if I'm wrong -- should replace
    the desired file atomically.

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Martin Gregorie@3:770/3 to Pancho on Fri Sep 15 15:17:14 2023
    XPost: comp.os.linux.misc

    On Fri, 15 Sep 2023 12:19:45 +0100, Pancho wrote:

    Personally, I want to run with full debug, stack trace, logging,
    exception handling, and bounds checking turned on all the time, even in production. Which is why I generally use a modern language like C# or
    Java.

    Same here. Many years back I wrote the type of debugging and programming support library I personally find most useful: it can report the content
    of all common variable types as well as dumping byte arrays as both hex
    and ASCII as well as parsing the command line and allow the amount of
    debug info the be controlled by a command line argument.

    They are structured as small libraries that designed to be lightweight
    enough to be left in a program when its in general use.

    The library was originally written in C, but I soon wrote a Java version
    as well, though this hasn't been separately published yet.

    If this sounds useful, both versions can be found on www.libelle-
    systems.com in the "Free Stuff" section.



    --

    Martin | martin at
    Gregorie | gregorie dot org

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Rich@3:770/3 to no@thanks.net on Fri Sep 15 15:02:26 2023
    XPost: comp.os.linux.misc

    In comp.os.linux.misc candycanearter07 <no@thanks.net> wrote:

    I thought double free was a SIGSEGV?

    Check my other reply to TNP for the details, but it is "undefined" in
    C.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Richard Kettlewell on Fri Sep 15 16:37:56 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 16:09, Richard Kettlewell wrote:
    candycanearter07 <no@thanks.net> writes:
    On 9/15/23 08:23, Theo wrote:
    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each
    i you call strdup() exactly once, and subsequently free() exactly
    once? If there was some code path that was breaking out of the loop
    or similar you might get such behaviour.

    I thought double free was a SIGSEGV?

    If Glibc detects it you’ll get a diagnostic and SIGABRT.

    I think that is conclusive.

    It seems to have been a double free caused by lack of defensive coding
    plus an asynch timed file write function causing the temporary creation
    of an empty file, or perhaps no file at all.




    If it doesn’t detect it then anything could happen - SIGSEGV is just one possibility.


    --
    I would rather have questions that cannot be answered...
    ...than to have answers that cannot be questioned

    Richard Feynman

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 15:26:10 2023
    XPost: comp.os.linux.misc

    In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 15:27, Ralf Fassel wrote:
    Note that the assignment

    thermometers[i].name=strdup(p);

    is *inside* the while() loop without a free().

    Probably you argue that there ever is only a single file to read in
    that dir anyway... Personally, I've been bitten by such
    assumptions, so I'd rather check once too often than hunting down
    "can't happen" bugs.

    I do think that what has happened is that a valid file name has been
    found with empty data, or no file at all, and then no strdup is done
    - but the free is, next time around.

    That should never happen of course, as the fopen/fwrite sequence
    should certainly not delete the filename, but it is entirely possible
    that a the fopen *truncates* its data. At which point we cant strdup anything, so the next free gets a woopsie

    Are the "files" being written to by an independent process separate
    from this reading process?

    If yes, are you doing any form of locking/synchronization to prevent
    the reading process from trying to read from a file that a writing
    process has open/truncated, but not yet written any data into?

    If no, then you may be also hitting a race condition where the stars
    align just right, the writer has just performed its fopen/truncate
    (leaving the file empty) and the kernel decides to context switch away
    to the reader at that point, before the writer can write and close the
    file. The reader will then see an empty file.

    The classic "lock free" solution to this one is for the writer to
    create and write to a temporary file, and after closing the temp file
    to rename() it to the name of the real file. Rename is documented to
    be atomic, so the reader would never see a half open, or partially
    complete, file in this case.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to Rich on Fri Sep 15 16:44:54 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 16:26, Rich wrote:
    In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 15:27, Ralf Fassel wrote:
    Note that the assignment

    thermometers[i].name=strdup(p);

    is *inside* the while() loop without a free().

    Probably you argue that there ever is only a single file to read in
    that dir anyway... Personally, I've been bitten by such
    assumptions, so I'd rather check once too often than hunting down
    "can't happen" bugs.

    I do think that what has happened is that a valid file name has been
    found with empty data, or no file at all, and then no strdup is done
    - but the free is, next time around.

    That should never happen of course, as the fopen/fwrite sequence
    should certainly not delete the filename, but it is entirely possible
    that a the fopen *truncates* its data. At which point we cant strdup
    anything, so the next free gets a woopsie

    Are the "files" being written to by an independent process separate
    from this reading process?

    Yes

    If yes, are you doing any form of locking/synchronization to prevent
    the reading process from trying to read from a file that a writing
    process has open/truncated, but not yet written any data into?

    No.

    If no, then you may be also hitting a race condition where the stars
    align just right, the writer has just performed its fopen/truncate
    (leaving the file empty) and the kernel decides to context switch away
    to the reader at that point, before the writer can write and close the
    file. The reader will then see an empty file.

    I think that is exactly the case. I didnt think that was in fact possible

    The classic "lock free" solution to this one is for the writer to
    create and write to a temporary file, and after closing the temp file
    to rename() it to the name of the real file. Rename is documented to
    be atomic, so the reader would never see a half open, or partially
    complete, file in this case.

    Yes, I was just wondering that before I read this post. Rename unlinks
    the old file does it?

    I might implement that, as well. It doesn't really matter however, as
    in practice the structures than contain thermometer data don't get
    altered if no valid data is found, so the lack of a proper file, ex of
    causing a crash, now simply means the (unused in this program) name data
    gets erased. For a few seconds. It simply misses a reading and uses last
    times data for everything else. Mostly the temperature.




    --
    Truth welcomes investigation because truth knows investigation will lead
    to converts. It is deception that uses all the other techniques.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From The Natural Philosopher@3:770/3 to vallor on Fri Sep 15 16:46:42 2023
    XPost: comp.os.linux.misc

    On 15/09/2023 16:12, vallor wrote:
    On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

    On 15/09/2023 14:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher <tnp@invalid.invalid>
    wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it | >>>>> to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used to
    store
    values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each i
    you call strdup() exactly once, and subsequently free() exactly once?
    If there was some code path that was breaking out of the loop or
    similar you might get such behaviour.

    Hmm. I free the pointers even for relay zones that don't have
    thermometers, whose pointers are 0. That isn't an issue.

    But that might be a remotely possible issue. I dont zero the pointers
    after freeing them as far as I can tell. The silly thing is that this
    program doesn't use the name anyway.

    Its used elsewhere Well I don't think its an issue, but I can zero the
    pointers anyway after free()ing

    Theo

    Hi, read the thread with interest.

    If you're getting SIGABRT, that's almost always the software
    calling abort(3). If you aren't, maybe there's a library calling it?

    $ man 7 signal
    [...]
    Signal Standard Action Comment
    SIGABRT P1990 Core Abort signal from abort(3)
    [but it also lists]
    SIGIOT - Core IOT trap. A synonym for SIGABRT
    _ _ _ _ _ _ _

    Meanwhile, if you want to avoid locking your file, you might want to write
    a fresh file with a unique name, then rename() it,
    which -- please correct me if I'm wrong -- should replace
    the desired file atomically.


    I think the consensus is that it does.

    Presumably if the read process has the old file open, that will be valid
    until it closes it?


    --
    "I guess a rattlesnake ain't risponsible fer bein' a rattlesnake, but ah
    puts mah heel on um jess the same if'n I catches him around mah chillun".

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 18:13:44 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | On 15/09/2023 15:27, Ralf Fassel wrote:
    | > * The Natural Philosopher <tnp@invalid.invalid>
    | > | > | thermometers[i].name=strdup(p); //
    | > | > | make a copy of the name and attach it
    | > | > | to our thermometer structure
    | > | > Memory leak if thermometers[i].name already contains something.
    | > | >
    | > | further up the line...
    | >>
    | > | bzero(filbuf,sizeof(filbuf));
    | > | /** first thing to do is clean any allocated memory used to
    | > | store values. **/
    | > | for(i=0;i<NUMBER_RELAYS;i++)
    | > | free(thermometers[i].name);
    | > Note that the assignment
    | > thermometers[i].name=strdup(p);
    | > is *inside* the while() loop without a free().
    | > Probably you argue that there ever is only a single file to read in
    | > that dir anyway... Personally, I've been bitten by such assumptions, so I'd
    | > rather check once too often than hunting down "can't happen" bugs.
    | > R'
    | >
    | No. you have misunderstood how the code works.

    Sorry, but I have to give that compliment back. You describe how the
    code is _intended_ to work. I described how the code _actually_ works.

    It all depends on what files with which content are there in that
    directory, so if there ever is only one file per ZONE, all is peachy.
    If not, all bets are off.

    Not 100% seriously, may I refer you to
    https://core.tcl-lang.org/tips/doc/trunk/tip/131.md
    ;-)

    | (It would be trivial to simply add a conditional that only strdups to
    | a pointer if it is NULL).

    With char* malloc'd pointers, I find it much easier to simply stick to
    the pattern:
    - initialize to 0
    - free before reassignment
    - assign to 0 after free when not directly reassigning
    instead of arguing at each place why not sticking to the pattern is not
    a problem.

    | However they are not at this time misconfigured, so it shouldn't be
    | the crash problem, [...]

    Agreed.

    | I do think that what has happened is that a valid file name has been
    | found with empty data, or no file at all, and then no strdup is done -
    | but the free is, next time around.

    Easy to verify via diagnostics, just add a stderr-output for every
    unexpected situation (such as the same index seen twice etc).

    | As is allocating memory only if the pointers are null.

    Why not simply free()/strdup()? If you assign to 0 only, you may get
    old contents for the new file inside the loop (can't happen, I know :-)!

    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Ralf Fassel@3:770/3 to All on Fri Sep 15 18:19:12 2023
    XPost: comp.os.linux.misc

    * The Natural Philosopher <tnp@invalid.invalid>
    | On 15/09/2023 16:12, vallor wrote:
    | > Meanwhile, if you want to avoid locking your file, you might want to
    | > write
    | > a fresh file with a unique name, then rename() it,
    | > which -- please correct me if I'm wrong -- should replace
    | > the desired file atomically.

    | I think the consensus is that it does.

    | Presumably if the read process has the old file open, that will be
    | valid until it closes it?

    On Linux: yes. Once a process has a file open, it sees the 'old'
    contents if the file is removed from disk.

    https://stackoverflow.com/questions/2028874/what-happens-to-an-open-file-handle-on-linux-if-the-pointed-file-gets-moved-or-d

    R'

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From vallor@3:770/3 to Ralf Fassel on Fri Sep 15 16:28:02 2023
    XPost: comp.os.linux.misc

    On Fri, 15 Sep 2023 18:19:13 +0200, Ralf Fassel <ralfixx@gmx.de> wrote in <ygav8cbh0ji.fsf@akutech.de>:

    * The Natural Philosopher <tnp@invalid.invalid>
    | On 15/09/2023 16:12, vallor wrote:
    | > Meanwhile, if you want to avoid locking your file, you might want to
    | > write | > a fresh file with a unique name, then rename() it,
    | > which -- please correct me if I'm wrong -- should replace | > the
    desired file atomically.

    | I think the consensus is that it does.

    | Presumably if the read process has the old file open, that will be |
    valid until it closes it?

    On Linux: yes. Once a process has a file open, it sees the 'old'
    contents if the file is removed from disk.

    https://stackoverflow.com/questions/2028874/what-happens-to-an-open-
    file-handle-on-linux-if-the-pointed-file-gets-moved-or-d

    R'

    Speaking of which: back in the days of Linux yore, you
    could retrieve the contents of a delete file if a
    process still had it open through: /proc/##/fd/*.

    (Nowadays, those are symlinks.)

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From vallor@3:770/3 to tnp@invalid.invalid on Fri Sep 15 16:21:58 2023
    XPost: comp.os.linux.misc

    On Fri, 15 Sep 2023 16:46:43 +0100, The Natural Philosopher <tnp@invalid.invalid> wrote in <ue1u93$3a7pg$3@dont-email.me>:

    On 15/09/2023 16:12, vallor wrote:
    On Fri, 15 Sep 2023 14:56:23 +0100, The Natural Philosopher
    <tnp@invalid.invalid> wrote in <ue1nq7$39033$1@dont-email.me>:

    On 15/09/2023 14:23, Theo wrote:
    In comp.sys.raspberry-pi The Natural Philosopher
    <tnp@invalid.invalid> wrote:
    On 15/09/2023 12:12, Ralf Fassel wrote:
    | {
    | *q++=0;
    | thermometers[i].name=strdup(p); //
    | make a copy of the name and attach it >>>>>> |
    to our thermometer structure

    Memory leak if thermometers[i].name already contains something.

    further up the line...

    bzero(filbuf,sizeof(filbuf));
    /** first thing to do is clean any allocated memory used
    to store
    values. **/
    for(i=0;i<NUMBER_RELAYS;i++)
    free(thermometers[i].name);

    You could get a SIGABRT if you were trying to free something that was
    already freed. Are you sure those are interlocked such that for each
    i you call strdup() exactly once, and subsequently free() exactly
    once? If there was some code path that was breaking out of the loop
    or similar you might get such behaviour.

    Hmm. I free the pointers even for relay zones that don't have
    thermometers, whose pointers are 0. That isn't an issue.

    But that might be a remotely possible issue. I dont zero the pointers
    after freeing them as far as I can tell. The silly thing is that this
    program doesn't use the name anyway.

    Its used elsewhere Well I don't think its an issue, but I can zero the
    pointers anyway after free()ing

    Theo

    Hi, read the thread with interest.

    If you're getting SIGABRT, that's almost always the software calling
    abort(3). If you aren't, maybe there's a library calling it?

    $ man 7 signal [...]
    Signal Standard Action Comment SIGABRT P1990
    Core Abort signal from abort(3)
    [but it also lists]
    SIGIOT - Core IOT trap. A synonym for SIGABRT
    _ _ _ _ _ _ _

    Meanwhile, if you want to avoid locking your file, you might want to
    write a fresh file with a unique name, then rename() it,
    which -- please correct me if I'm wrong -- should replace the desired
    file atomically.


    I think the consensus is that it does.

    Presumably if the read process has the old file open, that will be valid until it closes it?

    Yes -- and the old file remains allocated on disk until
    its file descriptor is closed.

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)
  • From Rich@3:770/3 to The Natural Philosopher on Fri Sep 15 18:27:20 2023
    XPost: comp.os.linux.misc

    In comp.os.linux.misc The Natural Philosopher <tnp@invalid.invalid> wrote:
    On 15/09/2023 16:26, Rich wrote:
    Are the "files" being written to by an independent process separate
    from this reading process?

    Yes

    If yes, are you doing any form of locking/synchronization to prevent
    the reading process from trying to read from a file that a writing
    process has open/truncated, but not yet written any data into?

    No.

    If no, then you may be also hitting a race condition where the stars
    align just right, the writer has just performed its fopen/truncate
    (leaving the file empty) and the kernel decides to context switch
    away to the reader at that point, before the writer can write and
    close the file. The reader will then see an empty file.

    I think that is exactly the case. I didnt think that was in fact
    possible

    It is. One of the points where Linux evaluates to determe if it should
    task switch is upon exit from a syscall. If your writer process runs
    out its timeslice during the in-kernel portion of the work for an
    fopen, then the kernel will suspend it and schedule another process to
    run. You now have an empty, unwritten file on disk which will not be
    written to until the writer is next scheduled by the kernel. If the
    next process scheduled is the reader, and it was last suspended just
    before it did an fopen() on this same file, it will now fopen() an
    empty file.

    The classic "lock free" solution to this one is for the writer to
    create and write to a temporary file, and after closing the temp file
    to rename() it to the name of the real file. Rename is documented to
    be atomic, so the reader would never see a half open, or partially
    complete, file in this case.

    Yes, I was just wondering that before I read this post. Rename unlinks
    the old file does it?

    Yes: (man 2 rename):

    If newpath already exists, it will be atomically replaced, so that
    there is no point at which another process attempting to access
    newpath will find it missing. However, there will probably be a
    window in which both oldpath and newpath refer to the file being
    renamed.

    I might implement that, as well. It doesn't really matter however,
    as in practice the structures than contain thermometer data don't get
    altered if no valid data is found, so the lack of a proper file, ex
    of causing a crash, now simply means the (unused in this program)
    name data gets erased. For a few seconds. It simply misses a
    reading and uses last times data for everything else. Mostly the temperature.

    Yes, your temperature monitoring was unaffected. But if the race was
    sometimes triggering the pointer double-free that your loop previously
    had, then the lack of atomicity was at least one trigger for the
    intermittent crash.

    So seems like two routes to fix:

    1) remove the conditions that can cause a double-free to occur in the
    code (seems like you've already done this from other posts)

    2) use rename() to move newly written files into place for the reader,
    so the reader never opens an empty file (exclusive of the writer
    crashing before it wrote anything to the file).

    For something that you'll potentially want to 'just run' for
    months/years on end without daily care and feeding, doing both is the
    better defense.

    --- SoupGate-Win32 v1.05
    * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)