| Details | Message |
|---|
Read-Only Author tamir michael Posted 25-Jun-2009 20:17 GMT Toolset ARM |  Are you using RTX? Can you help me? tamir michael Hello, I have a major problem with RTX and Keil don't seem to be able to help (as they want a simple scenario to cause the problem, but I cannot give them the hardware of course. Maybe I can make it go wrong using an evaluation board). I'm using RTX as the backbone of a product that needs to run for extended periods of time without reboot (weeks...). The problem is that RTX stops executing arbitrary tasks at arbitrary moments - they remain 'ready' but not get services. Today I discovered a task entering 'WAIT_MUT' while not using ANY mutex. My question: Are there any tips using RTX correctly? I am growing totally frustrated and tired of this, what am I supposed to tell the client?! I'm using latest and so expensive RL-ARM without any results whatsoever. Can you share your experience with me? Thanks you for your attention, Tamir |
|
Read-Only Author Per Westermark Posted 25-Jun-2009 20:50 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Per Westermark How do you know that the task isn't using a mutex - possibly a mutex owned by the CRTL? What happens if you run older versions of RTX and compiler/CRTL? |
|
Read-Only Author tamir michael Posted 26-Jun-2009 03:35 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Hello Per. This is the deal: No task ever locks more than one mutex at the time so there is no danger of a deadlock - all locks are very short and cannot hang. I have been having this problem for a very long time (months...) even before I started using any synchronization elements. Are you using RTX for a product that needs to run so long? I will have to check whether the task that I mentioned locks a mutex - I removed all of them from the program (except the ones in USB but that is not used at all when it goes wrong and there is no interaction between the tasks). Yesterday I managed to cause the system hang upon startup by replacing a 'os_evt_wait_and' with a polling loop (you see how desperate I am?). OK, I'm not giving up my time slice by why would that hang a system without any synchronization?! And how wonderful of the RTX kernel peripheral that it cannot tell you: 1. what mutex is the system waiting for (anything will do!) 2. the value of the PC per task. I'm telling you, if I cannot find the cause - RTX is going bye bye! |
|
Read-Only Author David Rose Posted 26-Jun-2009 06:44 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? David Rose I'm currently developing a product based on the STR9, so the code is still in development. It is being built with version 3.50 - I have not yet upgraded to 3.70 because we have a problem with the Ethernet driver that Keil are investigating. The main point is, this product will be required to run 24/7 so it needs to be reliable. Whenever I have to move across to something else, I always try to leave the development board running the latest code. The longest period over which I have left it running is 27 days - No problems were encountered. On one occasion I did have something strange happen with a task misbehaving. It ended up being my fault. An incorrectly initialised pointer (in some test code) was causing RTX data to become corrupt. |
|
Read-Only Author Per Westermark Posted 26-Jun-2009 08:19 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Per Westermark I'm not at work right now, but quite sure that the RTX-based project is using 3.40. The devices must run 24x7 basically indefinitely. Because of the size of the test systems, I have a limited number of them available. The longest run (befire I have had to step software or do other work) is maybe three months. No device has during the past 18 months done a restart unless I have requested it, or the unit been power-cycled. A note here is that these units has LPC23xx processors (not Cortex-M3). The CRTL and RTX code is using the modes Keil compiled them in. All my own code is using the full ARM instruction set, and not the Thumb subset. Didn't you say earlier in some thread that your product has a Cortex-M3 processor? They must be quite a number of changes in the RTX code to adopt it for the interrupt-controller changes. A change between Thumb and Thumb2 could potentially also change the instruction mix even for identical code. |
|
Read-Only Author tamir michael Posted 26-Jun-2009 08:31 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Per, Thanks for your reply. We are using a LPC2478. It is absolutely frustrating, but I must accept that I am doing something wrong or maybe it is a hardware issue (memory, timing...). I will strip a controller of just about the most elementary stuff to try to pin-point the problem... |
|
Read-Only Author Stuart Wright Posted 26-Jun-2009 09:26 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Stuart Wright Tamir, That was what I had to do to find the mutex deadlock between the ARM libraries and the FlashFS. Its even harder to find if it is infrequent deadlock. Its very frustrating when the OS and its libraries cause a deadlock! Definitely not what you expect of a real-time OS. |
|
Read-Only Author Per Westermark Posted 26-Jun-2009 13:39 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Per Westermark Just as Stuart Wright notes, I was wondering if your thread could have got stuck on a mutex inside the ARM library even if your own code doesn't make use of any mutex. How critical is a reboot? Do you have any inter-process supervision, that could watchdog-reset your board if any thread gets stuck? You you have long boot times, or lose important synchronization information that takes time (or is is impossible) to recollect? I would recommend that you have a quite aggresive watchdog timeout, and and design all threads so that they they gets regularly woken up even if you have no work for them. Whenever they wake up, they should then sign off that they are alive. At the same time, they should compute how long time it was since the last time they got useful work todo, and decide if they are unhappy either with a thread they expect data from, or with a thread that was expected to eat produced data. In the end, the watchdog shouldn't be kicked unless all threads and (when applicable) interrupt handlers are alive and working. Having extra dummy events in the system may possibly increase the processor load with 1%, but it is often very critical to notice and react when the application is only partially working. |
|
Read-Only Author tamir michael Posted 26-Jun-2009 13:56 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Per, I have implemented most of your recommendations already. I have been able to induce a failure much faster by bombing the controller with UART trash, and it seems as if the responsible is a chunk of code I added yesterday (at least, this failure scenario is solved, I hope, but it does not explain the failures before I wrote it!). Franc Urbank (the RTX guru) is trying to help, too. I will keep you informed as we'll know more after the weekend. |
|
Read-Only Author Uli Behrenbeck Posted 26-Jun-2009 17:59 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Uli Behrenbeck Tamir, your problem sounds really serious. Crossing all available fingers you'll catch it. I am quite new to ARM & Keil, but been in the embedded business for quite a while. My impression is that this RTX in general lacks all things giving "comfort" of debug aid. Other OS offer a lot more compared to Keil, eg Segger's embOS. One of the first things for me to do was to write sort of a schedule monitor to see what task was consuming what time. I found that the basic basics where there (the rt_agent_xx stuff) But it was basic and over all nonfunctional. So quite a bit left to code. Then I got me a help to capture all exception-relevant data .. and so on All that - in my opinion - stuff that should have been in the box - at least at that price. What I'd do: a) Is there a chance the the mutexes might get modified accidentally by sick pointers? What's your application (and your coding style ) alike ? On "good" days I manage really weird things ;))) Do you have the change to compile on a different machine (eg Visual Studio at highest warn levels) to get rid off the chance of such a problem ? b) simply modify the OS If you think it is a proper call that changes the mutex, why simply not track all mutex accesses ? (for a reduced time of course) If you don't have so many accesses , maybe yu can put them all into RAM , let the machine run a while and dump them after next reset (but beware to put them in non cleared area. Hope these idease were of any help G O O D L U C K !! Uli |
|
Read-Only Author Uli Behrenbeck Posted 26-Jun-2009 18:44 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Uli Behrenbeck just to go in detail with b) if you can simply get the mutex to change to a wrong value (as you mentioned above), try to track all intended mutex changes with their corresponding task ids. Maybe you can then detect, that there is no "set_wrong" at all or you can detect the failing task. even more luck ;) ULI |
|
Read-Only Author tamir michael Posted 26-Jun-2009 19:25 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Uli, Thanks for your reply. I made it home in a collapsed state :-) Here is the deal: the system kept on crashing even after removing all mutexes. I did find 2 problems due: 1. There was a path in the program that attempted to lock a mutex while at interrupt context. a little assembly magic solved that. 2. there was a problem in the communication task that could have locked the direction of the RS-485. I don't fully understand why, but the code is now more solid. The controllers are now running on my desk. If they don't hang by Monday, I think I can consider the situation as an improvement, even though the system used to crash before I started using mutexes etc and ran for over a week before without a problem (only lately it became so unstable). I will post what happened. |
|
Read-Only Author tamir michael Posted 26-Jun-2009 19:32 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael what I really miss in uv3/4/RTX combination: * LR per task * locked system resources per task * more aid to debug but the greatest benefit would probably be something my processor (LPC2478) does not have: a MMU... |
|
Read-Only Author Tamir Michael Posted 28-Jun-2009 06:24 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Here is an idea to improve RTX: Offer a debug mode binary / compile time macro that causes the system to guard critical data with a checksum that is being re-calculated every time the kernel is active. when a program overwrites something thus altering the checksum, the processor calls a callback and hangs, providing the history of the last 100 milliseconds of operation in terms of tasks that ran and interrupts that occurred. |
|
Read-Only Author tamir michael Posted 29-Jun-2009 09:10 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Something else that can help a lot is the kernel causing the processor to immediately go to abort mode, if an attempt is made to use anything that must not be used in any exception mode from anything but user mode. |
|
Read-Only Author David Rose Posted 30-Jun-2009 15:31 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? David Rose Tamir, Just out of interest, have you found the problem? I see in: http://www.keil.com/forum/docs/thread15089.asp you talk about a wild pointer. Was it that? |
|
Read-Only Author tamir michael Posted 30-Jun-2009 15:43 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael hell no, there was an erroneous path in the program that: 1. tried to lock a mutex during an exception. RTX hates for obvious reasons (processor mode interrogated to fix, no guard against IRQs needed during IRQ). I added that mutex to guard the file system as I knew that the function can be accessed in "parallel". I only forgot that one of the optioned is access while at IRQ mode...! 2. tried to access the SD card during an exception. RTX hate that, too (now it is moved into RTX application itself with a circular buffer. was in the pipeline for month, no time to implement...!). but my questions remain: why does RTX not die IMMEDIATELY when such a violation occurs? why does RTX not have a checksum to guard against wild pointers corrupting the kernel data ? and why it is allowed to run for at most 3 days in case 1 happens?!?!?! this was a close call, but at least we learned something... |
|
Read-Only Author Hans-Bernhard Broeker Posted 30-Jun-2009 22:19 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Hans-Bernhard Broeker why does RTX not die IMMEDIATELY when such a violation occurs? Because customers (including you) hate it when OS kernels "waste" time on such "superfluous" checking. Do you have any idea what it would cost you, in terms of interrupt latency or task switch time, if the kernel checked all its data every time before performing the job you asked of it? |
|
Read-Only Author tamir michael Posted 30-Jun-2009 22:26 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? tamir michael Hans, the answer to all of your questions is "yes". I forgot to mention here (but not in my official mail to Keil) that I would like to see such a debug mode for development purposes only. |
|
Read-Only Author Hans-Bernhard Broeker Posted 30-Jun-2009 23:37 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Hans-Bernhard Broeker I would like to see such a debug mode for development purposes only That would disrupt the integrity of the program under debugging, and thereby invalidate the result. It's entirely possible that introducing such checks in a debug version of the program not only makes the bug go away (e.g. because it depended on timing details of the production version), but also causes the program to develop even worse ones (violated timing requirements, stack overflow, ...). As people in aerospace put it: debug what you fly, and fly what you debugged. |
|
Read-Only Author Stephen Smyth Posted 2-Jul-2009 10:59 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Stephen Smyth "why does RTX not die IMMEDIATELY when such a violation occurs?" <BeginAngryRant> OMG! Is that supposed to be a serious question? How could that occur unless you have some hardware to protect you from such things? Have you considered using a part with MMU? If Keil did include such a library, and it were to be used, then the rogue pointer might just access some other part of the system, maybe not the RTX data. What would you ask for then? For Keil to provide checksums over user application space? <EndAngryRant> That's the end of my 2 cents :) |
|
Read-Only Author Tamir Michael Posted 2-Jul-2009 11:05 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Stephen, OMG! Is that supposed to be a serious question? I was dead serious, and still am. Have you considered using a part with MMU? too late for that now. If Keil did include such a library, and it were to be used, then the rogue pointer might just access some other part of the system, maybe not the RTX data. What would you ask for then? For Keil to provide checksums over user application space? come on. I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection. I am fully aware of the impact of it. Had you spent the past weeks desperately looking a failure that causes RTX to fail at absolutely arbitrary moments without a processor exception of any kind, you would not have quoted me using "<xxxAngryRant>" tags... |
|
Read-Only Author Stephen Smyth Posted 2-Jul-2009 11:21 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Stephen Smyth "I was dead serious, and still am." Hmmm. You didn't answer the one about how you would expect Keil to carry out this 'immediate' magic. As I implied, putting in the extra code into the RTX would possibly just end up hiding the problem. So it's value would be pretty limited. Wouldn't it? It could also give a nasty false sense of security. Like "The RTX isn't throwing an error, therefore my code must be right". Have you never been hit by the situation where an application would fail, but trying a build with the debug libraries (with the aim of narrowing down the problem) would cause the application to work again? |
|
Read-Only Author Tamir Michael Posted 2-Jul-2009 11:39 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael You didn't answer the one about how you would expect Keil to carry out this 'immediate' magic. I truly don't see the problem, given the knowledge of the underlaying chip. one of the issues above is independent of any hardware consideration, surely you can see. having done a similar thing in a OS I have written for an STR9, I know with complete certainty that it is possible. It could also give a nasty false sense of security. Like "The RTX isn't throwing an error, therefore my code must be right". but you can apply this logic to so many other factors that influence a system. the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong! another handy feature could be a recording of the last, say, 100 ms in terms of which tasks ran and what interrupts occured (with a timestamp). how much RAM does that cost? how much time does it save in the system fails and you have a post mortuary log? |
|
Read-Only Author Stephen Smyth Posted 2-Jul-2009 12:01 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Stephen Smyth "the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong! Absolutely - It's like the maxim, you can prove something doesn't work, but not that it works 100%. Unless, of course, you believe the work of statisticians. Unfortunately - Corrupt pointers are not normally considerate enough to scribble over the things you want them to scribble over. I doubt very much that you can detect this corruption IMMEDIATELY as you suggest. On something like an STR9 there would surely always be a delay. There would probably have to be a check at the next timeslice or OS call. A lot can happen during those delays. Oh ... And if you have this extra code and data for the purposes of checking execution history, you'd better protect that region as well from invalid pointer corruption. Consider an invalid pointer being used as an argument for a call to memset - Whoosh, lots of trashed data! |
|
Read-Only Author Tamir Michael Posted 2-Jul-2009 12:05 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Stephen, "immediately" was certainly an inappropriate term to be used here. |
|
Read-Only Author Hans-Bernhard Broeker Posted 2-Jul-2009 21:43 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Hans-Bernhard Broeker I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection. As you've been told quite a number of times now, that is utterly impossible. No amount of testing can ever verify anything. And a debug version that's not identical to the real program can proeve even less than that. The fact that a suspected wild pointer doesn't hit the supervised, critical data of the debug version, doesn't mean anything at all for the critical data of the release version. The critical data may be in a different place, or there may be more of it (to implement all that testing), or the wild pointer may point elsewhere. You're chasing a unicorn. |
|
Read-Only Author Tamir Michael Posted 3-Jul-2009 08:09 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Hans, Thanks for your input. I respect and fully understand everyone's comments, but unicorn or no unicorn, I am going yo try it because I still think it has a good chance of working, giving the following restriction: * The OS data must be positioned at a predefined location (easy to do with a scatter file), preferably at the beginning of (external) RAM (if possible). this eliminates many possibilities by preventing critical data regions from mingling with other data. |
|
Read-Only Author Hans-Bernhard Broeker Posted 3-Jul-2009 17:28 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Hans-Bernhard Broeker eliminates many possibilities The problem is that "many" just is not enough. |
|
Read-Only Author Per Westermark Posted 3-Jul-2009 17:38 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Per Westermark Not good enough, means that a test can't prove correctness. But a test that have x% probability of pinpointing the location of an error can still be meaningful. The big problem here is estimate how large the percentage would be, i.e. the gain in relation to the cost. The thing that is important to note, is that checksummed datastructures doesn't lead to correct programs. It is only a way to _maybe_ detect corruption. In this case, checksumming could possibly tell what task was running during the corruption. And if all ISR sets a flag, then checksumming could possibly add a list of potential ISR to look closer at. But checksumming would possibly point at the wrong task, in case the memory corruption is caused by a DMA transfer, started by another thread but creating the corruption after a task switch. |
|
Read-Only Author cactus blip Posted 3-Jul-2009 18:14 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? cactus blip But checksumming would possibly point at the wrong task, in case the memory corruption is caused by a DMA transfer, started by another thread but creating the corruption after a task switch. ouch, you are so right. I overlooked that one...! |
|
Read-Only Author Tamir Michael Posted 3-Jul-2009 19:04 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Per, The point you made about the DMA transfers is indeed an issue. I never meant this to be something more a possible little help in case things are that much out of control (believe me, they were until a couple of days ago - nervous clients, nervous boss, nervous keyboard...). I don't think Keil are going to do this with RTX (there are other, more pressing issues...) - let's leave it as an intellectual exercise. |
|
Read-Only Author Per Westermark Posted 3-Jul-2009 19:25 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Per Westermark I regularly look at checksumming as one of the available tools to detect problems, but prefer to use it in situations where it can be included in the release build. Just as previously mentioned, it is best to test the same build that is expected to ship. It is enough to change a single byte in RAM or flash to make the debug build pass all tests (even if buggy) while the release build will fail - possibly in a routine the customer will only trig once every three months. The reason I posted was that Hans-Bernhard Broekers post was aimed at pointing out that checksumming can't validate something as correct. But that is a separate issue from using it as a tool to detect something broken. A bigger issue with checksumming (at least when used in release builds) is to decide what action to perform in case of a checksum error. Auto-repair, reboot, deadlock, warn, ... |
|
Read-Only Author yann suisini Posted 9-Jul-2009 09:24 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? yann suisini I read this topic with a lot of interest as it reminds me the lack of debug support RTX is providing . Statistical data about % tasks execution times, state of a mutex , number of free memory blocks , number of free semaphores, etc. could be a VERY interesting improvement for the RTX library !!! |
|
Read-Only Author Justa Thought Posted 11-Jul-2009 11:28 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Justa Thought I am using RTX also. I was running a test over the weekend and came in to find the system had died for no apparent reason. I ran it overnight again and again and sometimes it would be running and other times it had died (basically running just a single task and ISR in this case). To make a long story short it was the ULINK debugger I left connected to the board. The damn thing cannot remain connected to the PC via USB when running overnight tests, even though it was not being used and my PC was off. The USB was unplugged from the ULINK and the 'problem' disappeared. (BTW: ULINK2 must be completely removed from the board). |
|
Read-Only Author Tamir Michael Posted 11-Jul-2009 15:27 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael hmm, I don't think this phenomenon has anything to do with RTX itself - we do keep a ULINK2 connected without a problem. are you absolutely SURE that problem has disappeared? some system's here ran for a week without a problem, others died after 1 or 2 days. |
|
Read-Only Author Justa Thought Posted 12-Jul-2009 11:51 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Justa Thought Well, the only test I was running was with a SSP as SPI Master interfacing to a single slave and exchanging an identical command/request sequence using Modbus ASCII protocol. No changes were made to the code when I removed the USB connection from the ULINK. It hasnt died since I did this. The frames exhanged were upwards of 2.7 million after a successful weekend run. Before the ULINK was removed I was consistently failing at a fraction of the frames reported. I cannot say that this is your problem but it is another angle you need to consider... |
|
Read-Only Author Tamir Michael Posted 12-Jul-2009 13:17 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael As I noted above, my issue is resolved already. It was all about addressing FlashFS's SD card driven card from exception mode, as well as a path that locked a mutex from exception mode. Once removed, no hangup were experienced anymore. |
|
Read-Only Author Xiao B Posted 8-Sep-2009 16:07 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Xiao B Hi all, I am using the At91sam7s128 with MDK 3.20 and ULINK2 and having this similar problem. My RTX clock interval is set to 1 ms and UART task has higher priority (2) than other tasks. I have 64 boards running exactly same firmware. A PC is always polling these 64 boards for the data. Randomly, one or two boards will stop running after a random time (from hours to days). However some boards are running for weeks without any problems. I caught this problem once with debugger and found it stops in the os_idle_demon and will not switch to other tasks that are ready for service. Regards, Xiao |
|
Read-Only Author Tamir Michael Posted 8-Sep-2009 16:37 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael See here: http://www.keil.com/forum/docs/thread15346.asp |
|
Read-Only Author Xiao B Posted 9-Sep-2009 13:30 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Xiao B Hello, Thank you for the link. It give me a lot of information. Is there any workaround for this problem right now? Do we have to pay extra couple of grands for the new release? Regards, Xiao |
|
Read-Only Author Tamir Michael Posted 9-Sep-2009 13:50 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael Keil are working on a new release of RL-ARM. If you have a license, it's for free. I do have a prototype fix which seems to work fine, but I do not think Keil will appreciate me distributing it right now. Wait a little longer for an official release. |
|
Read-Only Author Xiao B Posted 9-Sep-2009 14:17 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Xiao B Hi, I got confused about this license. I do have a license. The free upgrade is only good for one year from the purchase, am I right? I've been using this software for more than one year. Regards, Xiao |
|
Read-Only Author Tamir Michael Posted 9-Sep-2009 14:25 GMT Toolset ARM |  RE: Are you using RTX? Can you help me? Tamir Michael this is a question for Keil support, I am afraid. |
|