We have a product built from the t89c51cc01 or 03 chips using C51. Timer 0 does a 10mS interrupt to check a few hardware inputs. Timer 1 runs the serial port. Timer 2 provides an adjustable clock output around 40 KHz. CAN is initialized, but during most of our testing no messages are being transmitted or received.
The program spends most of its time in a measurement loop that can be set anywhere from 20 mS to 1400 mS per iteration, with some of the time extension taken by added measurements and some by timer delays. If the user presses 'm' on a serial terminal, the loop stops while a configuration menu runs.
C51 printf causes us more problems than any other software aspect of this product. Trying to guess what will happen if we change a printf statement, or change something else near a printf statement, seems an exercise in superstition as much as logic.
When I refer to printf, I mean to include sprintf as well. I had hopes that doing an sprintf and then sending the string out with putchar would solve some of our printf problems, but so far the sprintf version has always shown the same failure as the printf version.
I've searched the web and the Keil site, and while <http://www.keil.com/support/docs/867.htm> explains some of our issues, it is far from explaining the spooky ones:
----- ... variable length argument lists must be limited to some predetermined size. By default, we limit that to 15 bytes of arguments for memory models that use DATA memory. The limit is 40 bytes for memory models using XDATA. The MAXARGS compiler directive lets you change the default limit to whatever you desire. Note that since the limit [o]f DATA space is 128 bytes (including register banks and the bit area) you can't just set MAXARGS to 128. -----
We've learned to live with a maximum of 7 ints, 5 string pointers, or 3 long ints or floats per printf. What made this initially confusing is that using more doesn't always fail. We've had printfs that had worked for years suddenly start crashing the entire program because we changed the timing of the loop, or changed a different nearby printf. It also seems like as long as printf prints at least one of its values incorrectly, it doesn't crash. If you manage to get that incorrect display problem fixed, it is very likely the "fixed" printf will crash within ten or maybe a hundred iterations.
The output of printf is apparently "spooled" somehow. When we speed our loop up to 20 mS per iteration, we see characters output at the serial port almost continuously, though we know the actual printf calls complete in very tiny fractions of that total time. Recently we saw a case where this "spooling" repeatably failed:
We added a new prompt in our operator menu: printf("\n 'U' moves +/right, 'D' -/left, 'Q' accepts: "); Even written as: printf("\n 'U' moves +/right, 'D' -/left, "); printf("'Q' accepts: "); it would always be truncated exactly after the 'Q'.
After the function including the above printf exited, its calling function called two more printfs:
ANSI_cls(); // #define ANSI_cls() printf("\x1B[2J") printf("\n --- RUNNING...");
Both of those, which had previously worked fine for years, vanished from the serial output when we added the new prompt.
When that calling function exited, the main loop resumed, and its first printf did still appear properly:
printf("\n\n\nTime(ms)%cSmpl%cCnt%cNDVI%cselected_VI%c", s1, s1, s1, s1, s1);
Adding a delay after the first printf:
printf("\n 'U' moves +/right, 'D' -/left, 'Q' accepts: "); TmrDelay(10);
seemed to solve (or hide) this problem. Sometimes this helps and sometimes it doesn't, and we haven't a clue why.
Sometimes there seems to be a problem with passing the results of calculations to printf. This worked perfectly...
printf("\r%02d <%+05.1f %+05.1f> ", (NDVIsmoothed_16s+5)/10, (FP)(NDVIvalley_16s-target_nom_16s)/10.0, (FP)(NDVIpeak_16s-target_nom_16s)/10.0);
until the result of the third calculation got bigger than about 40000, which would crash the program almost instantly.
By the time it worked for all values, it looked like this:
minF = (FP)(NDVIvalley_16s-target_nom_16s)/10.0; maxF = (FP)(NDVIpeak_16s-target_nom_16s)/10.0; printf("\r% 03.0f <", NDVIsmoothed*100); // this is an FP of NDVIsmoothed_16s printf("%+05.1f ", minF); printf("%+05.1f> ", maxF);
Sometimes no amount of separating or dividing will make a certain printf work:
NDVIdev_16s = (NDVIpeak_16s-NDVIvalley_16s)/2; printf("\n%03d ±%03d ", NDVIsmoothed_16s, NDVIdev_16s);
That crashes with loop times under 400mS, even with the calculation separate.
Same with this sequence:
printf("\n%03d ", NDVIsmoothed_16s); printf("{-%3d +%3d} ", minus16s, plus16s);
Both of those would crash, almost always after printing a single '-' on the new line, within maybe 100 iterations - if the loop time was under 400 mS.
This works in exactly that same spot:
printf("\n% 05.1f ", (FP)NDVIsmoothed_16s/10.0);
Then there is the long series of printfs that form the normal output of the product, that have remained unchanged for many years, and that typically repeat every 100 mS whenever it is on. With a few particular recent hardware instances, if we set the loop time to exactly 50 mS, the program will crash after several minutes of operation. Choose any other loop time and the same device will run forever. Probably 99% of our systems run fine at any speed. We don't know for sure this is caused by printf, but the resemblance seems suggestive.
There must be much more to this than just the variable length argument limit. Could we have an in-depth explanation of the other limitations of C51 printf?
Have you made sure that you don't try to generate more output than the serial port can handle?
Think both about the sustained transfer speed of the serial port, and the max number of characters that may be queued.
printf() will have to use internal buffers when formatting some of it's output - numerical values are normally formatted in reverse order. When emitting it's data, printf will also have to either fail with truncated output, or busy-loop, if the serial port buffer is full.
sprintf() assumes (without being able to check) that the buffer you supply is big enough. After that, it is up to you to decide how to send the formatted data to the serial port, i.e. to decide what to do if the transmit buffer gets full.
When the runtime library supports it, snprintf is a lot safer to use than sprintf, since it takes an extra parameter specifying the size of the target buffer.
Our product has a CLI built on printf(). While it generally doesn't do continuous output, we see nothing like these sorts of problems. The problem likely lies a bit deeper than "blame the buggy library".
When you say "crash", what symptoms do you mean exactly? Garbled output on the serial port? The program stops running entirely? If so, how do you know it's not running? Lack of further output/activity, or do you have an emulator or monitor where you can break into the code and see where it is executing?
printf() is not reentrant. You must be sure that there is only one context that can call into printf, and that it always completes before the next entry into the function. That is, you can't call printf() safely from your main loop and interrupt handlers, or from several tasks in your RTOS if you have one.
The output of printf is apparently "spooled" somehow.
printf() outputs through putchar(). You can take control of the output buffering by replacing the implementation of this function. See putchar.c in your lib directory. Our serial driver has a 256-byte ring buffer in both rx and tx directions. This is good enough for a burst of a few lines of output.
Note that the sample implementation has some infinite loops on the interrupt bits or flow control state. Perhaps your program crash is the serial driver getting hung waiting for something that won't happen.
When we speed our loop up to 20 mS per iteration, we see characters output at the serial port almost continuously sustained transfer speed of the serial port
RS232 aysnc is 10 bits per character; 9600 baud is 960 characters per second or about 1ms per character. It's extremely easy to overrun the output speed. For example, in
printf("\n 'U' moves +/right, 'D' -/left, 'Q' accepts: ");
I count 46 characters output, 35 until just after the Q. (One more if your newline is CRLF instead of just one character.) Execute that statement once every 20 ms, and you have to have a serial rate of 46 * 1000 / 20 ms * 10 bits/char = 23,000 bps to keep up. That probably means 38400 on the usual serial port. If the rate is set to 19.2k, then you'll drop characters (interestingly enough about where the Q is; 35 * 50 * 10 = 17500 bps).
Or, you have to design your serial driver (putchar) to busy-wait on buffer space, which means your main loop calling printf() will be limited to the speed of the actual serial output.
Besides printf parameter size buffer, there are no other magic limitations of printf.
So your problem must be slightly different. Do you get any linker warnings (or did you disable any linker warnings)?
! Good to see so many dedicated and helpful people here!
I'll try to deal with all the questions and suggestions so far:
The serial port runs at 38400, and our worst case diagnostic output at maximum loop speed looks like this:
235800, 4, 2012, 0.860, 0.075, 0.042, 0.561, 2560, 16672, 43552, 53920 235830, 4, 2013, 0.855, 0.078, 0.042, 0.544, 2560, 16192, 43456, 53792 235860, 4, 2014, 0.850, 0.081, 0.045, 0.556, 2624, 16544, 43552, 53920 235890, 4, 2015, 0.857, 0.077, 0.042, 0.551, 2560, 16384, 43424, 53824
The first column is milliseconds, so we're averaging 30 mS per loop. The serial port is busy for 24 of those 30 mS. (With normal 34 char wide output this "20 mS" setting actually does 20 mS.) Except for the few hardware examples that show the "50 mS loop time crash", any product can do this all day. The "50 mS" problem units can do it all day at 40 or 30 mS, and at 50 mS they will crash with only 34 characters per line as well as with the 80 char lines.
printf() will have to use internal buffers when formatting some of it's output - numerical values are normally formatted in reverse order.
I have the Plauger book with the standard library code, but I'm far from understanding the details of _Putfld and _Litob. I think I see where it converts numbers from lsb upward... But it looks like _Printf reads the format string left-to-right. Of course the Keil library may be different. I would think that since the number of bytes we can request to be converted is strictly limited to MAXARGS, it should be possible to guarantee that all subsequent buffers are sufficient to do the conversions.
When emitting it's data, printf will also have to either fail with truncated output, or busy-loop, if the serial port buffer is full.
Many of our printf issues happen in menu mode, where the human is by far the slowest factor in the loop. The menu does occasionally print more than 80 consecutive characters without the timer enforced 6+ mS pause, but I believe printf problems are as likely to appear with a single line of output as with multiple line pages. The one truncation example in my earlier post is unusual; we are much more likely to see wrong values or a complete program crash than truncation.
I had high hopes, and made sure my buffere were big enough, and of course I haven't tried an sprintf version of every problem line, but the ones I have tried have showed the same issues as printf.
-----
Crash means the program stops running. There are several LEDs that flash in synch with different processes, and they all stop. A power cycle is required, as there is no hardware reset button to push. I've also looked at the processor with a scope, and nothing is moving except the Timer 2 clock out, and the ALE pin which is just tied high.
No RTOS, and none of our interrupts use printf. I learned about "not reentrant" long ago on another project in another development system, where output from interrupt routines was required. Ended up queing all output for a single interrupt handler to print.
But "You must be sure that there is only one context that can call into printf, and that it always completes before the next entry into the function" might still be relevant. When is printf complete? When it returns control to the next line of code? Sometimes adding a TimerDelay after a printf call seems to solve (or hide) a problem... Could that mean printf returns before it is truly "complete"?
In my "spooling" example, it seemed like there was some interaction between printf returning and its calling function returning. If the function that called printf returns before printf is truly "complete", and the returned-to function immediately calls another printf, is that two simultaneous contexts?
printf() outputs through putchar().
That's a Keil adaptation, right?
... to be continued...
... continued...
You can take control of the output buffering by replacing the implementation of this function. See putchar.c in your lib directory. Our serial driver has a 256-byte ring buffer in both rx and tx directions. This is good enough for a burst of a few lines of output.
I see this version. Looks like very simple I/O with XON/XOFF added.
But... I also see this version in our serial.c:
/******************************************************************************/ /* */ /* SERIAL.C: Interrupt Controlled Serial Interface for RTX-51 tiny */ /* */ /******************************************************************************/ ... /******************************************************************************/ /* putbuf: write a character to SBUF or transmission buffer */ /******************************************************************************/ void putbuf(char c) { if(!sendfull) { /* transmit only if buffer not full */ if(!sendactive && !sendstop) { /* if transmitter not active: */ sendactive = 1; /* transfer the first character direct */ SBUF = c; /* to SBUF to start transmission */ } else { /* otherwize: */ outbuf[oend++ & (OLEN - 1)] = c; /* transfer char to transmission buffer */ if(((oend ^ ostart) & (OLEN - 1)) == 0) sendfull = 1; } /* set flag if buffer is full */ } } /******************************************************************************/ /* putchar: interrupt controlled putchar function */ /******************************************************************************/ char putchar(char c) { if(c == '\n') { /* expand new line character: */ while(sendfull) { /* wait for transmission buffer empty */ // otask = os_running_task_id (); /* set output task number */ // os_wait (K_SIG, 0, 0); /* RTX-51 call: wait for signal */ otask = 0xff; /* clear output task number */ } putbuf(0x0D); /* send CR before LF for <new line> */ } while(sendfull) { /* wait for transmission buffer empty */ // otask = os_running_task_id (); /* set output task number */ // os_wait (K_SIG, 0, 0); /* RTX-51 call: wait for signal */ otask = 0xff; /* clear output task number */ } putbuf(c); /* send character */ return (c); /* return character: ANSI requirement */ }
I inherited the original code for this product, and haven't paid any attention to some parts of it. Looking at that header about RTX-51 makes me wonder if we should be using this serial.c. I find it in \Program Files\Keil\C51\RtxTiny2\Examples, but not in \lib. It does seem to work, most of the time, and it looks like the os_wait() has been commented out so it ends up doing pretty much what the default putchar does, right? I guess it is needlessly setting "otask" over and over, though...
I turned off the ability to enter XOFF in SerialISR() long ago, and our device does normally ignore Ctrl+S. Looks to me like the output buffer is tested for overflow so we shouldn't crash the whole program even if we did somehow get serial out turned off.
That probably means 38400 on the usual serial port. If the rate is set to 19.2k, then you'll drop characters (interestingly enough about where the Q is; 35 * 50 * 10 = 17500 bps).
Except for that one 'Q' example, we don't drop characters. We either get the correct number of wrong characters, or the program dies.
Do you get any linker warnings (or did you disable any linker warnings)?
No linker warnings. I won't even bother loading a file that came with warnings. Options for C51 says "Warninglevel 2" which looks like it means all on. Options for "BL51 Misc" "Disable Warning Numbers:" is blank. Is there anywhere else they could be turned off?
So unless there is something about adopting that "RTX-51" putchar (and trying to remove the RT part), that is triggering our mysteries, I'm still mystified!
Loren
Note that since sendchar() is busy-looping while waiting for space to insert the next character, it may loop forever if you have a time glitch somewhere resulting in the serial transmit interrupt not being processed.
Suggestions. Since it is not printf it self, what could it be.
1) The stack? is it overflowing. 2)not reentrant means printf can not be called from another task until it is finished. Uses semiphores if needed. 3) passing bad data to printf cuases undefined behavior. Random crashes are possible. Insure all data is possible. 4) C51 Updates, But code may work under old compilers. Then fail after an update. 5) You Ring buffer code does not work right.
An emulator maybe hand to have a this point.
The timing says #2 is most likley. with a posiblity of #5
I don't find any "sendchar". Do you mean putchar()? I see that it blocks on sendfull, which means putbuf() has decided the 32-char output buffer is full. And I see that interrupt "TI" is required in order to make space and clear sendfull. Our sendstop/XOFF is disabled, the TI interrupt appears to come from hardware, and we shouldn't have any higher priority interrupts interfering with it, so I don't see how anything our program does could cause us to occasionally loop forever here.
The minimum standard putchar() in lib also blocks if the TI interrupt isn't received:
/* * putchar (mini version): outputs charcter only */ char putchar (char c) { while (!TI); TI = 0; return (SBUF = c); }
So if not receiving TI is our problem, I don't see how the standard putchar() would help us, and the serial.c version of putchar() must not be hurting us.
If the current serial buffer is only OLEN = 32 characters, that probably explains why when we print 36 chars per measurement the port is only busy for about 14 of 20 mS, but when we print 80 characters (with the same load of other tasks) they take 24 of 30 mS to appear, with the serial port active for all of the additional time. Looks like the program does block and wait for the characters to be output. I'll have to experiment with a larger output buffer for possible speed increases, but I don't think any of this explains our crashes.
Since it is not printf it self, what could it be.
1. I have no idea what our stack does. How does one tell, short of running with a hardware debug connection (which we're not set up to do)?
2. We aren't using the RTOS, and we aren't calling printf from any interrupts that we know of. I still worry that some part of printf is still active and vulnerable even after it has returned control to the next program statement. How else to explain why adding a timer delay after a printf suddenly makes it work?
3. Sometimes we have been guilty of passing the wrong variable sizes or types for our format strings, but the batch of issues that prompted my original post here was checked very carefully for such things.
4. We rarely load updates. We've been using the same compiler for literally years, since none of the reported fixes appeared to affect us.
5. Ring buffer referring to the subject of my post just before this one, right? As I said there, it seems to be working fine under normal circumstances. I don't see how the particular configuration of a printf that is feeding it characters should suddenly make it fail.
I have no experience with hardware debugging or emulation on this system, and have been enough underwhelmed by other microcontroller versions that I've never explored what's available. Is there something we should look at?
I agree #2 seems most likely. In the recent batch of problems, the printfs were interspersed with directly called putchars. But how could that lead to reentrancy issues, unless like I fear, some part of printf isn't really "finished" when it passes control to the next putchar statement?
Today I discovered an overrun buffer in one of the putchar sequences adjacent to one of the printf issues in my original post. After the printfs shown in the two following versions - the first of which failed if the result of the third calculation got bigger than about 40000:
and the second of which worked even with the overrun buffer following it:
there was code to configure and print a string:
if ((g_idx >= 0) && (g_idx <= 28)) g_ctr[g_idx] = '>'; // insert more chars at various positions in the buffer... for (i = 0; i <= 28; i++) { putchar(g_ctr[i]); }
The buffer "g_ctr[]" was declared too small for the number of locations we referred to in configuring and printing it. In every case we wrote '-' characters to the same excess length in the buffer, but when the result of the third calculation got bigger than about 40000, we began writing other characters, like '<', '^', or '>', past the declared space.
I don't have the ability to go back to the map files for the failed versions, but the buffer was always in plain data memory, not xdata, and in the version that worked following the second printf form shown above, it was located here:
D:0007H SYMBOL g_idx D:0029H SYMBOL g_ctr
I've gotten pretty good at understanding maps of variables in xdata, but I'm not aware of the fine points of regular 'D' memory. It looks like almost every routine that has any 'D' variables has them at 0007H and 0029H, or in that area, and nobody has any beyond 004FH. Then there comes:
------- MODULE PRINTF D:0050H PUBLIC ?_PRINTF?BYTE D:0050H PUBLIC ?_SPRINTF?BYTE C:7F7AH PUBLIC _PRINTF C:7F74H PUBLIC _SPRINTF ------- ENDMOD PRINTF
At one point the g_ctr buffer was getting characters written from 0 out to 32 (instead of just 28), which would run from 0x29 to 0x50, just exactly enough to step on the "?_PRINTF?BYTE", whatever that is. If the buffer was actually located there in the problem versions, that would be highly suspicious...
Except if the "?_PRINTF?BYTE" is critical to printf, how could overwriting it with '-' hundreds of times be acceptable, but overwriting it with '>' not be?
How could the second form of printf survive having its "?_PRINTF?BYTE" overwritten when the first form clearly couldn't?
And how could overwriting the buffer _after_ the printf had returned control to the following code spoil the printf that should already have been complete? Maybe it killed the _next_ printf? In every case where it crashed, it printed the initial '\r' of the printf, and then a single '-' character. None of the printfs should have printed a '-' there, but the following putchar sequence almost always should have printed one right after the printf returned.
What is the "?_PRINTF?BYTE", anyway? Google took me directly to <http://www.keil.com/support/man/docs/bl51/bl51_ln_mapfile.htm> which shows the item but doesn't explain it to me.
Does this line of exploration possibly explain at least this one printf mystery? It can't be a general explanation, because we hardly have any plain Data variables; almost everything is in xdata.
If you've got a buffer overrun, then you're really into "undefined behaviour" territory - so anything could happen!
Especially with Data Overlaying!