Modbus slaves stop listening on ports

Hi all.

I have a dedicated network for our controls devices. All switches are AB Stratix 5700 and the network has worked perfectly for 7 years. We have 88 devices on it, 29 of which are Modbus slaves. Recently I added 4 new units called Maxcess Spyder Plus tension control modules, which have a web interface and also communicate via Modbus. All have unique IPs, and slave IDs, but after a few days, 3 of the 4 stop listening on ports 80 & 502, eventually after a week or two, the last one also does the same.

Rebooting them (power cycle) restores the ports instantly. They are on a machine where they connect to an 8 port hub, then on one ethernet cable to a dedicated port on a Stratix switch. No other devices on that segment / port. When restarted, they function perfectly with no errors, until they drop off the network one by one.

They still respond to pings, and ARP requests (using AngryIP scanner identifies when they close ports / no longer listen) - this is baffling. No other devices exhibit this behaviour and the OEM has been unable to shed any light on this. Physically the devices work locally, just not over the network after they stop listening of course.

The last device to stop working, lasts about a week where the other identical units fail within a day or two. Eventually none work. Only power cycling brings them back online. Any ideas?
 
Every thing has been working for years except for the new guys, all the same model. Hmmm.

Any chance this box runs on some version of Windows CE? Sure sounds to me like a 'memory leak' problem and what leaks better than windows?
 
You could be correct there, at the moment I have no idea but it's a basic little thing with a 2 line dot matrix LCD display and a bit of I/O. It's possible that CE is indeed under the hood.
 
The fact that pings and ARP still work but HTTP (port 80) and Modbus (port 502) stop working leads me to believe this is a firmware bug in the Maxcess devices. Unfortunately, intermittent issues such as this are very hard for manufacturers to debug and fix, especially if they are unable to duplicate the issue in house.

However, there is a chance it could be an issue with the 8 port hub that is connecting these devices to the Stratix switch. I have seen scenarios where a device stops communicating, and power cycling the device seems to correct the problem. However, the root cause was actually that the Ethernet switch was configured incorrectly or was defective and had stopped routing packets to the device. When the device was power cycled, the device would send a gratuitous ARP on startup, which the switch would see and update its routing information, allowing packets to be routed again.

I recommend trying the following test the next time this issue occurs. Take a laptop and plug an Ethernet cable directly betweeen it and the problematic device (make sure your laptop is configured for a compatible IP address as the device you're connecting to). Then try your ping, ARP, port 80, and port 502 tests.
 
How odd, that reply is almost word for word what the OEM said the other day. I will try with one of the units which has dropped out.

I am certain the switch is not the culprit as I have other modbus devices and Rockwell hardware on this switch, using other hardware ports. I've been doing this a very long time, and have never seen a device do this before. I've been an electronics and automation engineer for 35 years. It's a new one on me.

Thanks for the reply though, it is a good response and is the sensible line of course. I did wonder about the TPLink hub (unmanaged desktop switch) but again I use these in other machines and have never had a modbus slave drop out when connected.

I'll try some more testing as suggested and see what happens. I know the configuration in these Stratix switches inside out. I also have dozens of them across the plant, I configured every one of them originally.
 
I have had problems with consumer grade switches needing period power resets which I put up with a short while and then I replace. Of course, one never discovers the root cause for a problem with a disposable consumer product.

Please tell us what you find. I'm really curious.
 
I will let you know. I should say I have tried rebooting the TPLink switch and it made no difference. The Rockwell Stratix 5700 on the other hand, is no consumer grade item. It's a hardened, industrial, gigabit fully managed switch at £3000 each, running a Cisco OS. These are not toys.
 
The details you've provided (adding now that rebooting the TPLink switch does not resolve the issue) continue to seem point to a firmware issue with the Maxcess devices. I do, however, still recommend performing the test I mentioned in my last post to confirm that, even when directly connected to a laptop, the Maxcess devices are unresponsive on HTTP and Modbus TCP when this issue occurs.

If your testing does confirm that the issue is isolated to the Maxcess devices themselves, indicating that they have somehow gotten into a bad state, then I suggest trying the following. Take a look at the configuration on your Stratix switch for the port that the TPLink switch is connected to (and through which all the Maxcess devices are connected). Disable any options that would cause the Stratix switch to inject any meta data (such as VLAN tagging) into the Ethernet packets as it routes them through. The thought process behind this is that it may be some sort of unexpected meta data within the Ethernet packets that is causing the Maxcess devices to go off into a bad state.
 
That port as with all automation devices in the plant controls network is tagged as vlan 600. We have 5 vlans for various requirements. Could this cause them a problem?
 
Having the devices on a VLAN, in general, is not problematic. The issue is the meta data (e.g. encapsulation) added to the Ethernet frames by the switch. If the Stratix's VLAN port that is connected to the TPLink switch is configured as "tagged" (a.k.a. a "trunked port"), then the switch will inject VLAN tags into the Ethernet frame, which could be contributing to the Maxcess devices going into a bad state. You should try setting the port as "untagged" (a.k.a. an "access port") so that the switch does not add VLAN tags to the Ethernet frames.

Note that there may be other settings on the Stratix configuration, or the specific port's configuration, besides VLAN tagging that may also cause the switch to encapsulate additional information into the Ethernet frame. You would need to investigate this and try to ensure that the Stratix switch is not adding any additional meta data into the Ethernet frame.
 
Ah I see what you mean. No of course all physical ports with devices connected are set as access ports. Only the two ports which are coming from, and leading to, another managed switch in the chain, are set as Trunk. All others are "device for automation" in Rockwell terms..access ports.
 
So far I've got these units on a TPLink switch (unmanaged) and then to a touch panel pc / tablet stashed in the control cabinet. It's attached to the hub, and it's WiFi is connected to my controls network so I can remote into it..so far I'm pulling modbus data from one of the devices and making a TCP connection to all 4 on port 80, using an app I wrote in C# where I can log what happens.

When they were going to the Stratix switch, they drop off pretty soon apart from the last one standing. It stays working mostly (drops out but comes back in again)

Now they are only going to the tablet pc, so far no dropouts after 36 hours. Something about the network affects them when hooked up to the switch so it could be the VLan tags or something else they don't like. Time and further testing will tell.

Next if they survive the weekend intact, I'll bridge the ethernet to the WiFi and see what happens.
 
Ok so latest thoughts are, see what you guys think, all ports on this switch are set to SmartPort setting "Multiport Device for Automation" as per Rockwell documentation when using a device with multiple software ports or when connected to an unmanaged switch. I wonder if its better to select "None" - I am about to try this as they have not dropped out for a long time when simply connected to a laptop via an unmanaged switch.
 
I'm not familiar with the configuration settings on that switch, but whatever configuration you can do with the Stratix switch to make it operate as close to a dumb, unmanaged switch as possible for that port will probably be your best bet in getting the Maxcess devices to behave.
 
Yep, I have tried every version which is plain but unfortunately it defaults to VLAN 1 for anything non-automation (which is no use, the Modbus master is a ProSoft card in a PLC rack so it loses all connection with the slaves if they aren't on VLAN 600). I can't select "None" and keep them on VLAN 600, the switch throws an error and refuses to save the setting.

I do see the ProSoft card has two options for MBAP port over-ride, but not sure if that would affect them. In case it isn't the switch.

Going to try shutting off the ProSoft link to them for a few days and read direct from a client and see what happens, so still using the switch but not the ProSoft. If that runs ok, then I need to speak to PS and see what they advise.
 
Latest info, during testing I see the ProSoft card has created several (7) concurrent TCP connections to the Modbus slave and this is causing the lockups. Odd why it doesn't close the connections each time it polls for data. Weird, do you think this is a slave or master fault?
 
How did you determine the ProSoft card created several concurrent TCP connections? Did you capture the Ethernet packets using Wireshark?

Similarly, how did you determine this is what's causing the lockups? Assuming you took a capture, does the Wireshark capture show the Maxcess devices rejecting new TCP connections?

If you would like us to review your capture, please attach it.

Regarding the behavior you're seeing, it is actually preferred for a Modbus/TCP client to keep a connection open, as this is more efficient. Some Modbus testing tools open and close a TCP connection for each and every request, which is very inefficient, but since it's just a test tool, this doesn't really matter.

Similarly, a Modbus/TCP client may open several concurrent TCP connections for efficiency reasons. However, doing this to a single server seems unnecessary (though it could be a result of how the device's Modbus driver is implemented). The client (master) is responsible for initiating and managing the TCP connection, but if the client abandons the connection (e.g. it goes offline suddenly, the Ethernet cable is disconnected, etc.), the server (slave) is supposed to automatically close the connection after a (vendor-specific) timeout period and recycle the socket, allowing new connections to be made.

It seems that there may be two issues happening here, one in the ProSoft and another in the Maxcess. The ProSoft shouldn't need several connections to a single server. There may be a setting (or it could be due to the way the register definitions are configured) in order to prevent this from happening. Even so, the ProSoft should never be abandoning the TCP connections. It should either continue using each and every connection or gracefully close them.

On the Maxcess side, these devices should be closing any abandoned TCP connections and recycling the sockets, allowing new connections to be made. This is supposed to happen after some timeout period, which is vendor-specific, so it's hard to say exactly how long this would take, but one would hope it to be a matter of minutes.
 
The info was obtained by telnet into the Maxcess device and using their firmware I can see all connections and sockets, they told me the command list the other day. I didn't use wireshark, I can't create a mirror port on the local switch the devices are on (none left free) and the laptop in the office can't see traffic between two host devices on the automation network with wireshark of course. Other devices I have show the same multiple TCP connections on port 502 from the ProSoft cards but seem happy enough and never act up. Maxcess have confirmed their device can't cope with multiple TCP connections on the same port so are looking into this as are ProSoft.

The client register definitions are simple. IP Address, Slave ID, MB function, starting register and register count for each slave device.

The TCP connection status is ESTAB/0 so not abandoned. I think this is how the ProSoft card operates with MBAP devices and the Maxcess slave just doesn't like it. No idea yet why the card does this with some slave brands and not with others, yet the client definitions are the same format in the PS card for all devices (except IP, slave ID, register start and count etc of course lol)

The only thing I did differently with these devices is use MB function FC 3 instead of FC 4 so should not matter, read holding register instead of read input registers should behave exactly the same in a modern MB slave device. Even the Maxcess manual says use FC 3 or FC 4 to read data.

Some of my many MB devices show only one TCP connect on 502 but some do show multiple (talking here about all the slaves which work perfectly and not the Maxcess Spyders) - I note the ones with single connections are embedded HMS AnyBus chips and the ones which show multiple connections to the PS card are on the other side of HMS AnyBus Gateway modules, I guess HMS have the market cornered in MB to ETH comms as these are all different OEMs equipment.
 
FWIW, I don't know that I'd necessarily trust the information provided by the Maxcess devices, since they're the devices having issues, possibly due to a firmware bug. It is not unreasonable to assume that information provided by these devices is untrustworthy.

It would be best to verify this by setting up a switch with port mirroring (or use a dedicated Ethernet tap device) and capturing the packets using Wireshark. It's likely that Maxcess and ProSoft are going to request this anyway to assist them with troubleshooting this issue.
 
I will do, another fly in the ointment seems to be the combination of ProSoft and a Phoenix FL BT EPA bridge in the machine on 3 of the 4 Maxcess units as those areas of the machine cannot have cables run to them due to moving parts. Power is over sliprings and comms is wireless. On the BT enabled sections there are many TCP connections made and never broken by the PS card to the Maxcess unit. On the one with no BT, only one connection is made. On all of them if I use a non ProSoft MB client, they all behave and only have one connection.
 
Top