Kent J. Chen's WebLog

a personal journal by an addictive geek

Had fun hunting down a network issue at work

Posted in Information Technology on May 28th, 2009 by Kent

image

I have two NetGear GS748TS 48-port switches stacked together acting just like one at work, mainly for all workstations and no servers connect to either of them. It seemed to work fine, except, one of the units kept rebooting itself for no obvious reason once a while, mostly during the day while people are still at work.

After checking out the log file without any luck, I went and submitted a support ticket to NetGear. They replied back and suggested that I upgrade both switches to the latest firmware, which I did a few weeks back. The problem of randomly rebooting itself seemed to have gone away. However, more serious problem emerged as some people started complaining some network slowness issue. On the first couple of cases we fixed them by simply switching to the different port on the switch. But I was getting more suspicious that there must be something else we were missing, as I received more and more same cases with no obvious pattern in it I can look at.

Then both my assistant and I started doing some research and looking around the settings on the switches that would cause issue. One of the possible solution come up yesterday was disabling the STP (span tree protocol) because there were quite a few entries mentioning it in the log file. So we planed to turn this feature off last night. Guess what, the whole network was shut down right after I selected “disable STP” and applied the change.  Big ooops.

Went into office really early this morning and tried hard to figure out what exactly went wrong.  Luckily, I was able to hunt the monster down within a short period of time. Yes, I was lucky. After I rebooted a couple of times without any luck, I tried to un-stack them but still no luck. Then I disconnected one of them from the main backbone, and all of sudden everything backed to normal. Ok, that’s good. So then I connected to my laptop to the disconnected switch and was surprised find out I was still connected to the whole network. What the heck was going on? I was on a disconnected switch which shouldn’t link me to anywhere on the server but it did. The only possible reason was there was connection between these two switches. Sure enough, if that’s the case, everything started making more sense. It was the network looping that caused all these issues. Damn, I should have realized this way back, and I should have been more careful when installing it in the first place.

To recap, there was a connection that connects both switches together in the first place maybe before they were being stacked together. And because the STP is default on and keeps the network from lopping the loop, we were survived not being shutdown completely but were suffering the rebooting issue as one of the switches sometimes was tired of this looping and decided to take break. And when new firmware was applied, the system got improved that it doesn’t need the break anymore but was thinking too much to decide which way to re-route the looping, which was why the network slowness was randomly reported from various people. And then in the end, when we decided to turn the STP off everything was broken down. But the bright side was we got the issue solved completely.

So what’s the lesson learnt from this exciting experience? NEVER LOOPING.

Random Posts

Disclaimer
Before you act upon this blog, please read this disclaimer.

If you find this post useful, you may consider following me on Twitter and subscribing to my RSS feed.

Free eBooks & Whitepapers

  1. Lee says:

    Great post! So true.

    If your switches are looping… MAKE THEM STOP! Hahaha.

  1. There are no trackbacks for this post yet.

Leave a Reply