Brief description of the setup: We have a mobile platform, where the OEM ECU and different sensors are connected to a X90 PLC. These sensor data and setpoints are exchanged via Powerlink to an APC910. There, a hypervisor system with an AR and a debian GPOS is installed, exchanging data via exOS.
We regularily but not always see two different exceptions, triggering a SERV mode of the APC runtime.
but before I start to change the setup: What could be the reason that these exceptions sometime aren’t triggered for a whole week, and sometimes it happens every hour? I was not able to reliably trigger this event for troubleshooting. There is no logic in the PLC code of the APC, it is just a transparent data link so to say.
Idle task on APC has 59% CPU usage.
The APC uses system timer as clock source, while the X90 uses PLK as clock source. Can it be that the PLK communication interferes somehow?
Is it possible that anything on the debian system can cause these “outliers”? Network, CPU, memory, disk usage?
Stack trace of the cycle time violation indicates an issue inside the exos library is attached.
Sadly, I heard that exos is discontinued and not even listed on br-automation.com anymore, so I don’t expect any support here.
If we would fully refactor the system, is there a best practice configuration to exchange data to a linux system?
I haven’t any experience with ExOS or deeper knowledge of Hypervisor system.
But at least, the data I see looks a bit similar to an issue I had many years ago (using ArWin, not Hypervisor).
That’s the reason why I want to share my thoughts:
the data im the backtrace points in the direction of increased ethernet communication.
In a general PCI bus architecture (multi master bus), the bus master that is active right now cannot be forced to stop data transfer and release the bus.
That means, if a bus master has control, other bus masters can’t tranfer their data until the active one finishes the transfer. The AR interface cards, but also other PCI devices like ethernet interfaces and so on, are such masters. And if perhaps the ethernet interface blocks the bus too long, the AR interface card cannot transfer the IO data within the configured cycle, which leads then to a IO scheduler cycle time violation.
And that happened in my case in past: because the ethernet interface has had under some circumstances a high load peak (in my case, it was because of a remote desktop access to the system via VNC), in rare cases the cyclic IO data transfer cycle was too late.
As I said, it happened with a different system, but also with shared hardware components between two operating system on one hardware, and it looked a bit similar.
So maybe it’s at least worth to take that information into your considerations.
(in past, our solution was lowering the ethernet port speed down to 100MBit instead of 1 GBit by driver settings, but I can’t imaging that’s exactly the solution for your case, too).
About communication between AR and GPOS:
the’re two possibilities, depending on the needed data transfer speed and amount:
using the virtual ethernet interface between AR and GPOS for IP based communication (OPC UA, or TCP / UDP)
Using a shared memory between AR and GPOS (in AR, the’re two libraries ArIscShm and ArIscEvent for that, for GPOS there’s a C API available, for details please check Automation Help, for example here: B&R Online Help)
I do have quite a lot of experience with exOS but I would need some more information.
Its been a while, but it seems you’re using the ethernet based communication? → lxi_eth_client_process
If so, that implementation is not deterministic and was only designed for simulation purposes, ie. using ArSim←→WSL
It also seems youre using Cyclic#5 for data exchange? So far tests with exOS on hypervisor systems was only performed on Cyclic#1 and Cyclic"#2
In general, as you write, since its discontinued, and you just need basic data communication to linux, simplest for sure is to just send datagrams (AsUDP) because even though its connection less, you have a simple way of controlling things like heartbeat, which is hard to handle if your using TCP with lingering sockets and you pull the cable.
For shared memory, see if you can get access to the lower levels of exOS, via lxi - that has some neat functions implemented to ensure consistent data transfer via ringbuffers in the shared memory. Ie I can provide that, you just need to ask the right person for permission.
Hi Patric, you helped us a lot in the beginning to setup the project It worked well for some years, but these errors are getting more and more. Might be related to an increase in network traffic as already mentioned.
The lower task classes with faster period are not used in this project, I can remove them. I was not aware that there is a semantic difference in the task classes except from the period and priority. I can change that all, together with increasing the system timer. Before that I just wanted to have a reliable test triggering the exception to be able to evaluate any changes.
We developed the exOS component with the WSL simulation, and just switched it with the GPOS in the hypervisor. It might be that there is something not configured correctly. I don’t use any ethernet communication except of what exOS is doing internally. lxi_eth_client_process is only called inside the ExData library acc to the stack trace?
Humm, yep that certainly looks like shared memory! I think youre good in that respect. maybe the semtake etc comes from the locking of resources between the ExApi function blocks and the rest of the data commincation. in order to have consistent data there are a few semaphores/critical section between the DMR in TC5 and the ExApi function blocks. if you arent already dong this, please place the function blocks in the same TC. that could potentially solve the problem, ie that a function block takes up a resource which locks the DMR but is then delayed due to other reasons. the DMR runs alongside the cyclic system in whichever taskclass you configure it. doesnt have to be 1 or 2.
The DMR lives in TC#5 as can be seen in the screenshot above, as well as the exOS application does.
The hottest candidate for me is still something blocking the hardware bus for the system tick exception, which might be solved by increasing the system tick and refactoring the task classes. But if this solves the TC#5 overrun of the exOS program, I’m not sure.. TC#5 has 10ms period with already 10ms tolerance. Which means that it does not finish within 20ms if I understand that correctly.
Yep. This could actually be a thing, as we found issues with the storage monitor influencing the behavior of the Linux realtime behavior on the 4.91 versions (it was introduced before but thats where we really came across it). The feature is in a later version, huess 4.92 but it could also be 4.93. So far I have only seen this issue affecting data latency (ie. data arriving delayed), and with the 2.1.1. version which I think was released before it was discontinued, I also wrote a story on that in the help, how to have a fully synchronized system without hickups, which also included switching off the storage monitor. It could be this issue, again - I havent seen it crash like that before, and it could have to do with task class assignment, but the synchronization at the end of the taskclass is actually a blocking call, which so far has never been longer than a few microseconds, because the linux service is already waiting for the ping from AR on its end (so its really just a short sync-handshake). Now if the linux process should be locked due to a shared resource access, this could potentially lead to a cycletime violation. So its a likely potential candidate to appear as a ghost, becuase it really needs to fall in precisely at the wrong time, meaning there could be weeks between the failures. Its just a working theory, but it would potentially make sense.
OK now how to check this - Ive seen different variations of the influence of storageMon, depending on which debian system is used, how it is configured, and possible also the size (and I guess type) of memory. Easiest way to see the influence is running cyclictest in linux. if you run it as barebone linux first - you get the systemperformance, and if you run it with hypervisor, you get acceptable results, but once and again, youll get spikes. would be interesting to see how large those spikes are.
cyclictest -n -p 90 -i 1000
Otherwise a newer version (4.93) and disable storageMon, a smaller disk size, or changing priority could be means to get around the problem.
But we dont know for sure that it is the issue. If you check the profiler of the crash you could potentially see something regarding storagemon there, that would give us a hint. but again you dont always get that.
Let me know where youre going with this, and we can take it from there
I was able to somehow reproduce at least the “failed system tick” exception: I use a ROS stack on the GPOS linux side. Sending ROS actions periodically in a loop triggers the exception. Don’t ask me what the underlying DDS communication layer is doing on the network card to influence the timing.
I tried now to reconfigure the system timer to 1ms, and move everything in lower task classes #1 and #2. Components/General/DMR Task class is set to an existing TC being faster then the exos app. Now I get this funny memory acceess violation directly after reboot:
I think we have reached the limit where the global community can help you. I think you have to contact your local support team, and they can help you investigate more deeply. Please provide them with a link to this discussion as well. And it would be super good if you updated us with any new findings.
This is with a typical load of the APC with a decent amount of ROS network traffic. This seems to be quite a lot, right? Max latency without any application running (also no exOS), is still about 500us.
Your latency doesnt look alarming. what you show there are acceptable results, mostly as a result on being second-in-line on a hypervisor system - AR will always have priority over the GPOS system, so the 10-50us you see in standalone runs are not always acheivable when the system needs to step aside for the RTOS. I think from all tests i Did was that we can get very predicatble results within ~2ms on RTOS which is pretty good for most applications. I cannot explain the crash you have though. you can send me some of the logger files, then I might be able to say more when I can trace the stack position. Until then, my suggestion is, because that was what we used when testing - use TC#2 for DMR, together with your application, run it anywhere between 2-10ms. then I can check from the loggers if I can pinpoint anything. or just send the system dumps, then we also get the profiler. Cannot promise you to spend days on this, but I can for sure have a look at it. My email is unchanged. Until then, cheers.