Hi,
no, I don’t think that the cluster IP can only communicate with one device - in the part of the wireshark trace where modbusTCP sessions are running, the cluster IP already communicates with 2 devices in parallel: .22 and .33, so I’m pretty sure that’s not the problem.
And, after some more analysis, I can see that both devices .22 and .33 are running several sessions / streams in parallel.
The difference is (at least in the small snapshot of a running communication I can see in the trace), that the device .22 sometimes does it in a way I would expect and that is wanted when using TCP: it opens a stream (TCP SYN), queries some modbus data, and then closes the connection (TCP FIN).
For example like here:
But in some other streams, I can see that this is not happening, and the PLC .53 sends a RST flag after some time where the stream is idle (no data flow at all on this stream).
As this always happened after 5 seconds of stream idle, I assume it’s the “device idle timeout” configured in the PLCs modbusTCP stack. (that was also the same situation in the screenshot of my post above, when .33 started communication)
For example here:
And: the device .33 opens some other parallel streams, but doesn’t end that streams in a way that I would expect (I haven’t seen any TCP FIN in all of the .33 initiated streams!) - instead of ending the stream by FIN, the device .33 just sends a RST flag.
For example here:
All in all, because of the combination of all things mentioned above, I could imagine that the PLC comes into some ressource bottleneck after some time with the handling of all of those parallel sessions, where some of them aren’t closed in the way TCP expects it (because all of these sessions have to have handled by the PLC in a way TCP defines the communication flow, and the number of TCP handles is limited inside a PLC).
This bottleneck could lead to the situation, that somewhen all TCP handles of the PLC / all reserved handles for server port 502 are blocked which would then lead to the situation, that no more connections are possible (until reboot).
For me, the questions are:
- who is .33, and why does it not close the sessions in a TCP-usual way?
- why are parallel TCP streams from the same device are querying the PLC server, is that neccessary / reconfigurable?
- what has changed since it does not work anymore? As long as you haven’t changed software on the PLC and it ran for a year or so, I think there must have been some change in the network (for example introducing device .33?), in the firmware of the communication partners (have there been software changes / upgrades on the SCADA device for example), or somewhere else in the infrastructure → even if the PLC is not responding anymore after some time, I’m relatively sure that the PLC is not the root cause but some change in network / in the number of devices / in the devices behavior (that’s what I interprete out of the wireshark trace).
- I don’t think that it’s directly linked to this issue, nevertheless I want to mention it: in the complete trace, I see a very high amount of so-called ARP requests (requests for the MAC address behind of a IP address) for a dedicated small number of IP addresses, much more then I would expect (97% of all captured packages are arp requests).
- most of that requests (87.4% of all arp requests) are sent by a HMS device (10.85.106.41), is that the switch? Or is there some permanent network / IP scanner running or something like that?
I think that’s more or less everything that can be interpreted out of the wireshark trace.
Best regards.