Troubleshooting Page Faults

1 Overview

1.1 What is a page fault?

The processor throws an exception to the operating system if it tries to access an invalid or protected memory location. The B&R operating system logs this type of serious memory violation as a page fault (error 25314).

1.2 What can cause a page fault?

Processor memory can become invalid as a result of programming errors such as:

  • Null or incorrect pointer
  • Division by zero
  • Accessing an index of an array which does not exist (i.e. the 11th element of a 10 element array)
  • Incorrectly copying memory from one location to another. For example, if you copy (X) bytes of data to another location where only (X – 50) bytes are free, then 50 bytes of necessary information is overwritten. The next time the overwritten memory is accessed, the data is invalid and therefore the processor cannot successfully execute the command.
  • Etc.

2 Troubleshooting a Page Fault

The following steps can be used as a guide to troubleshoot the cause of a page fault. It is recommended to execute these steps in the provided order.

2.1 Check the Last Change

If the AS program was previously working without triggering a page fault, then you should go and check the most recent change you made. If you most recently changed anything to do with pointers / arrays / memory copies / string copies / etc, then this is likely the cause of your page fault. Check to make sure you are not accessing an invalid element of an array. If you are doing any memory manipulation, make sure the size of the memory you are
manipulating is correct.

If you are using subversion, you may want to consider checking out a version of the program from the last point in time you know that the page fault was not occurring and continue development from there.

If you are unsure what you last changed or if reverting the last known changes did not solve the page fault, then move on to section 2.2.

2.2 Check the Backtrace

Open the logger in Automation Studio by going to Open → Logger. Make sure the System logger module is visible because error 25314 gets entered into the System log:

image

Sort by time. Scroll down in the log until you see error 25314. If you do not see error 25314, then refresh the logbook via the following icon:

image

Select the 25314 entry. Then select the Backtrace tab below:

In some cases, there will be lines in the backtrace with a green arrow next to them. If you double click on these lines, then AS may jump directly to the line of code that caused the page fault. For example, after double clicking on either of the two lines shown in the backtrace below, AS jumps straight to line 9 of the task called Test:

image

In this case, it is clear that the page fault was caused by an array index out of bounds. The ‘z’ array only has 10 elements, but the 32766th element is written to by mistake.

If the information in the backtrace does not lead to you to a specific line in the program, then move on to section 2.3.

2.3 Identify a Trigger for the Page Fault

From this point forward it will be extremely helpful if you can determine a specific trigger which causes the page fault in your application. Some examples of these triggers are:

  • Every time you navigate to page X in the visualization and click button Y, the page fault happens.
  • Every time you save a recipe, the page fault happens.
  • Ten minutes after running the machine in mode X, the page fault happens.
  • Etc.

If you do not know the trigger for your page fault then that’s okay, but the troubleshooting process will take longer because you will have to wait an arbitrary amount of time until the page fault occurs again. In either case, continue to section 2.4.

2.4 Add the IecCheck or AdvIECcheck Library

Often times the backtrace information does not directly lead you to the cause of the page fault. If you are unable to manually identify the programming error via sections 2.1 and 2.2, then there are two libraries that you can use to help track down the cause of the page fault: IecCheck and AdvIecChk.

2.4.1 IecCheck

The IecCheck library is provided with Automation Studio. This library checks for division by zero, null pointers, invalid array indexes, invalid enumeration range accesses, and invalid subranges. If you add the IecCheck library to the project, the library will enter a new error into the logbook (55555) which contains more specific information about the problem it found (such as CheckDivInt for an integer division by 0). The page fault itself will not be triggered because the IecCheck library catches the problem first and sends the PLC into service mode. As a result, you will no longer have access to the backtrace information of the page fault.

Note that if the cause of the page fault is not one of the situations that the IecCheck library checks for, then the page fault will still be entered in the logbook and the IecCheck library will not give you any additional information.

Here are the steps to use the IecCheck library:
  1. Add the IecCheck library from the toolbox to the Libraries package of your Logical View:

image
image

  1. Make sure that the IecCheck library exists in the “Library Objects” section of the Software Configuration (Physical View, right click on the CPU, select “Software”, scroll down to the “Library Objects” section).

image2023-1-10_9-13-17

  1. Rebuild the project and transfer to the PLC.

  2. Trigger the page fault.

  3. Once the PLC is in Service mode, check the logger (Open → Logger). Sort by time. Look for a 55555 entry. If this entry exists, then that means the IecCheck library found a problem and sent the PLC into service mode before the page fault was triggered.

The 55555 entry will tell you the type of error it caught, the name of the task in which it caught the error, and the corresponding task class. For example, in the screenshot below the IecCheck library caught an array out of bounds issue in the task Test from task class 4:

image

  1. Fix the programming error which was identified in step 5 and then re-test the code to make sure the page fault is solved.

2.4.2 AdvIecChk

The AdvIecChk library is a modified version of the IecCheck library. The AdvIecChk library performs the same functions as the IecCheck library, but in addition it provides the following details about the location of the page fault:

  • Last executed task class cyclic
  • Last executed task name
  • Type of programming error
  • Variable values from the last executed line of code
  • Backtrace pointing to the last executed line of code

The AdvIecChk library is not provided with Automation Studio, but it is included in the zip file along with this document.

Here are the steps to use the AdvIecChk library:
  1. Add an “Existing Library” from the toolbox into the Libraries package of the Logical View.

image

  1. Navigate to the location of the provided AdvIecChk folder. Click Finish.

image

  1. Make sure the AdvIecChk library is present in the Library Objects section of the Software Configuration:

image

  1. Rebuild the project and transfer to the PLC.

  2. Trigger the page fault.

  3. Once the PLC is in Service mode, check the logger (Open → Logger). Sort by time. Unlike with the standard IecCheck library, with the AdvIecChk library the page fault entry (25314) will be present in the logger. If the AdvIecChk library found a problem, then immediately prior to the page fault you will see a 55555 entry (but in this case it will just be a warning).

The 55555 entry will provide the following information:

  • Task class cyclic of the task with the detected issue
  • Name of the task with the detected issue
  • Type of programming error
  • Max, min valid value of the variable that caused this fault
  • Value of the variable that ended up causing this fault

For example, in the screenshot below the 55555 entry indicates that an invalid index (32766) was attempted to be accessed from an array of [0…9] in task Test within task class 4:

(If the AdvIecChk library did not detect a problem, then the page fault will be by itself in the logger with no accompanying warning 55555.)

  1. Select the 25314 error and go to the Backtrace. Double click on the “FUNCTION START POSITION” line with a green arrow with the Module name that matches the task name that was identified in the 55555 warning immediately prior to the page fault. For example:

When you double click this line, Automation Studio will take you to the exact line of code where the page fault was encountered.

  1. Fix the programming error which was identified in step 7 and then re-test the code to make sure the page fault is solved.

2.4.3 Additional Information Regarding IecCheck / AdvIecChk

Note that these libraries only help to troubleshoot page faults cause by programs written in IEC languages (Structured Text, Instruction List, Function Block Diagram, Ladder Diagram, Sequential Function Chart) plus Automation Basic. These libraries will not find page faults cause by programs written in C or C++.

These two libraries are capable of catching many types of IEC programming errors, but not every kind of error is detected. For example, consider the case that a pointer is pointing to an incorrect (but not invalid/protected) location in the memory. A memory copy operation using this pointer might end up overwriting some part of memory. When the processor tries to access this corrupted memory location later on, it will cause a page fault because the data in memory no longer makes sense to the processor. In this kind of situation, the memory has already been corrupted, so there is no way to tell what line of code caused the corruption.

You should not leave either of these libraries running on a production machine. They should only be added and utilized during active troubleshooting of a page fault. Once the page fault is solved, you should delete the library from the Logical View, rebuild and transfer to the PLC.

If your page fault did not trigger an entry in the logger from these libraries, then move on to section 2.5 to keep troubleshooting.

2.5 Systematically Disable Tasks

If you were not able to identify a trigger for your page fault in section 2.3, then skip to section 2.6.

Assuming the IecCheck / AdvIecChk libraries did not catch the cause of the page fault, then next you can systematically disable tasks in order to narrow down which task is causing the problem.

Steps to disable tasks:
  1. Go to the Physical View.

  2. Right click on the CPU and select “Software”. This opens up the Software Configuration.

  3. Right click on the task you want to disable and select “Disable”. Afterwards, the taskname will appear grayed out. To disable more than one task at a time, you can use Ctrl + Click to individually select each task, or you can use Shift + Click to select a consecutive chunk of tasks. Then right click and select “Disable”. Note that you cannot use the Shift + Click method across tasks in different cyclic task classes.

In general, here are the steps to approach this method:

  1. Disable half of the tasks in the software configuration.

  2. Transfer to the PLC.

  3. Try to trigger the page fault using the previously identified trigger.

a. If the page fault still happens, then you know the problem is contained in the half of your tasks which is still enabled. Disable half of the remaining tasks and repeat the process. Continue this iterative process until the page fault no longer occurs. At that point, you know the page fault is being caused by the tasks you most recently disabled.
b. If the page fault does not occur, then you know the problem is contained in the half of your tasks which is currently disabled. Re-enable half of these tasks and repeat the process. Continue this iterative process until the page fault occurs again. At that point, you know the page fault is being caused by the tasks you most recently re-enabled.

Once you identify the task which is causing the page fault, you will have to manually read through this task to identify the problem. You can also implement the method described in section 2.6 specifically on this task in order to help find the issue.

2.6 Utilize a Remanent Variable

This method can be used if a non-IEC language is causing the page fault (and therefore the IecCheck / AdvIecChk libraries do not apply) or if you have identified the problematic task in section 2.5 but cannot determine the specific problem within that task.

Here are the steps to utilize a remanent variable for the sake of troubleshooting a page fault:
  1. Declare a remanent variable by checking the “Retain” checkbox in the .var file. For example:

Whether you make this a global or local variable depends on whether you have narrowed down which task is causing the page fault.

For more information on remanent variables, refer to the AS Help: Programming → Variables and data type → Variables → Nonvolatile variables → Remanent variables

  1. Increase the CPU memory configuration to accommodate this new remanent variable (if necessary). To check and see if this is necessary, build the project. If you get a build error related to remanent memory, then:

a. Go to the Physical View, right click on the CPU, and select “Configuration”.

b. Expand the “Memory configuration” section and all sub-sections within this section.

c. Make sure that a device has been selected for “Device for memory RemMem” and that the “RemMem memory size” is nonzero.

d. Make sure all of the configured memory sizes are greater than or equal to the used memory sizes.

Refer to the screenshot below.
For more information on memory configuration, refer to the AS Help: Programming → Editors → Configuration editors → Hardware configuration → CPU configuration → SG4 → CPU properties – Memory configuration

image

  1. Throughout your program, manually set the remanent test variable to incrementing values. For example:

If you have not identified which task is causing the page fault, then at the very least you are going to want to set this variable to a unique value at the top of each cyclic program.

  1. The next time the page fault occurs, check the value of this remanent variable via the Watch window. The value of this variable will help you to identify which chunk of code caused the page fault. For example, referring to the screenshot above, say the value of PageFaultTest was 2 once the PLC rebooted into service mode after the page fault. That means that the line of code which caused the page fault was somewhere within lines 30 and 33.

If you are unfamiliar with the Watch window, refer to the AS Help: Diagnostics and service → Diagnostics tools → Watch (variable monitor)

14 Likes

Awesome, thank you Andrew!

Feedback: in this part is a broken picture, below

  1. Make sure that the IecCheck library.

Thank you for catching that! I have edited the post to correct the missing image.

1 Like

Division by zero should not cause page faults. Otherwise an excellent article.

2 Likes

Here I want to add a small hint.
UseCase: You found out that you got a problem (page fault) and added the IECCheck library. Now you want to find out where exactly the error occured. The error is for example inside a library and there are many instances of a functionblock inside many tasks or even in the same task. Inside the logger you get the informationwhat caused the error (division by zero, overflow of array, etc.) but you might don’t know where exactly the error came from (sometimes you don’t get a backtrace).

To find the exact position, you can also do the following: Breakpoint in the function MakeEntry
MakeEntryBreakpoint

If now the error occurs you can find some good information inside the debugger console. There you can see the calling order. Example:

Here you see, that in the task “ModuleFill”, in the file Main.st in line 39 the call of the function in the library AxLibM in the file AxCtrl in line 442 the error occured.

So this is my way to find out where an error came from :slight_smile:

Best Regards
Waldemar

5 Likes

This is one solution, which can be very useful. But if the error occurs directly after boot, you may not be able to even start the debugger.

As you told, the default implementation of IEC check does not give you details of the code position and no backtrace. By adjusting the IEC check you can make an exact backtrace which will make finding the issue much easier. In the initial post @andrew.kelley mentioned the AdvIECcheck library which seems like a tuned version of the default implementation.

The trick lies in provoking a defined and clean page fault within the MakeEntry() function. With this trick you will get the clean backtrace.

Default implementation:
FUNCTION MakeEntry
	// build the message text
	status_name := ST_name(0,ADR(taskname),ADR(group));
	brsstrcpy(ADR(out_text),ADR(text));
	brsstrcat(ADR(out_text),ADR(' > in task > '));
	brsstrcat(ADR(out_text),ADR(taskname));
	MakeEntry := ADR(out_text);
	// write fatal error -> PLC will go to service -> no backtrace
	ERRxfatal(number,index,ADR(out_text));	
END_FUNCTION
Tuned implementation:
FUNCTION MakeEntry
	// build the message text
	status_name := ST_name(0,ADR(taskname),ADR(group));
	brsstrcpy(ADR(out_text),ADR(text));
	brsstrcat(ADR(out_text),ADR(' > in task > '));
	brsstrcat(ADR(out_text),ADR(taskname));
	MakeEntry := ADR(out_text);
	// write warning log -> PLC will stay in RUN
	ERRxwarning(number,index,ADR(out_text));
	// create 'defined' pagefault -> will bring the PLC to service with a backtrace
	brsmemset(0, 0, 1);
END_FUNCTION

It may seem unconventional to trigger a PageFault on purpose, but it is very effective in practical use. In the logger you will then see two messages as in the screenshots of @andrew.kelley. First there will be a warning which says e.g. 'CheckBounds > in task > Test' and then there will be the PageFault with the backtrace.

The AdvIECcheck from the initial post seems to have more fine tuning for a more detailed log message…

2 Likes

Hi!

Thanks for sharing so much information in this topic.
I’m quite interested in the AdvIecChecks library, mentioned by @andrew.kelley in his initial post. It looks like it adds a lot of very useful information to the error message. Is it somewhere available for download?

1 Like

Hi Manfred,

I have attached the library here!

AdvIecChk.zip (5.2 KB)

4 Likes

Great!
Thanks a lot and have a nice weekend!

A post was split to a new topic: Why exception by zero is not triggered