Troubleshoot Issues Related to Connectivity

Published on January 2017 | Categories: Documents | Downloads: 43 | Comments: 0 | Views: 505
of 13
Download PDF   Embed   Report

Comments

Content

Troubleshoot issues related to connectivity, zoning,
or performance


Have there been any recent changes to the switch configuration?
(Zoning, adding ports, adding devices, relocation, or replacements of
switches)
Note: Recent changes would be any changes made within 3-7 days of the
first occurrence of the issues.



Is the issue related to connectivity? Are all connected ports showing a
valid link established LED? (IE, controller host ports, host HBA ports, and
switch ports)







If so; is the issue persistent (is the connection down now), or is the
issue intermittent?
How long has the issue been occurring?
Note: Exact time stamps are most valuable when trying to investigate
any SAN issue. If a timestamp cannot be provided, try to press for the
exact day of initial occurrence.
Was the connection ever functional? If so, when did the connection
begin to experience issues?
Has any hardware or components been replaced?

Is the issue related to configuration? If so, what has been done to
remedy the issue?
Note: Preliminary troubleshooting steps or actions against the fabric
device are key pieces of information when trying to get a clear picture of
an issue.



Is the issue related to performance? If so,



Are all ports affected or is the issue limited to a select number?
What was the previous performance and how has it changed?
o What are the expected performance results?
o How is the performance being measured?

Log Collection

1 | Te j a B r o a c d e T r o u b l e s h o o t

Brocade Fabric Switches can be accessed primarily in two ways:
A Web GUI: By entering the Switches Management IP address into a Web
browser URL.
-ORDirect command line interface: With this method, the user executes
commands directly against Brocades Fabric Operating System (FOS).
Note: The default usernames are commonly 'admin' or 'user' and the
password is 'password'.

There are two types of data collection associated with Brocade switches,
a supportshow and a supportsave.
Note: If the customer’s issue falls outside of constraints of a port issue or
basic functionality, have them collect a supportsave instead of
a supportshow.


If the issue is related to basic errors such as port issues and
connectivity, a supportshow might be sufficient. However, when a more
complex issue involves zoning, performance, reboots, failovers, or issues
that seem to run over the entire fabric, then a supportsave will be
required.



If the issue spans an environment with multiple switches (ISL
connections and long distance configurations), a supportsave from all
relevant switches in the fabric is required. To be completely covered,
always request a supportsave when dealing with Brocade.

SupportSave Data Collection
When the issue is more sophisticated, a supportsave from the switch is
required. The supportsave command is available as of Fabric OS version 4.4;
however, Fabric OS versions (later than 6.2.x) provide a significantly better
collection of logs, which represent the status overview of the switch and
fabric. If you have a director class switch with two CPs and/or core plus
function blades, it will also collect information from all the blades.
The supportsave will upload between 25 and 80 files depending on the
platform, Fabric OS level and enabled features to an FTP or SCP server. These
will not be tarred or zipped into one file, so it is important you create such an
archive with a meaningful name (switchname-domainid-fabricid.zip)
2 | Te j a B r o a c d e T r o u b l e s h o o t

Example:
Fabosv4.4switch:admin> supportsave -u anonymous -p password -h
xxx.xxx.xxx.xxx -d /directory -l ftp
This command collects RASLOG, TRACE, supportShow, core file, and FFDC
data, and then transfer them to an FTP/SCP server or a USB device. This
operation can take several minutes.
Note: supportsave will transfer the existing trace dump file first, and then
automatically generate and transfer the latest one. There will be two trace
dump files transferred after running the following command:
OK to proceed? (yes, y, no, n): [no] y
Saving support information for switch:BR4100_IP127, module:RAS...
Saving support information for switch:BR4100_IP127, module:CTRACE_OLD...
Saving support information for switch:BR4100_IP127, module:CTRACE_NEW...

Performance related issues
If there is no sign of any obvious physical issue, there might be link-related
issues which can identify performance issues and/or protocol related errors.
Brocade counters are cumulative and remain so until a certain counter
wraps, a switch reboots or the statistics are manually cleared.
Support requires in these circumstances that a new baseline is created, a
certain run-time has been achieved and separate commands are submitted
against the suspected switch or switches.
Note: By clearing the error counts, future log collections can be compared to
the original to determine deltas and narrow down issues.
To create a new baseline with the cleared counters, perform the following
steps:
1. Log in to the switch using Telnet or SSH
2. Run the statsclear command
3. Run the slotstatsclear command
After the agreed (mostly around one hour) run-time, capture the following
output using your terminal program:


porterrshow



slotstatsshow -c 5 -p 60



sloterrshow -c 5 -v -p 60

This gives one interval of front-end porterrors, five intervals of overall traffic
3 | Te j a B r o a c d e T r o u b l e s h o o t

statistics, and five intervals of error statistics over a five minute time span.
Collect the output into a file and upload these together
with supportsave under the respective case ID.

5-minute initial troubleshooting
on Brocade
Switchstatusshow
Explanation Provides an overview of the general components of the
switch. These all need to show up HEALTHY and not (as shown here) as
"Marginal"
Example
Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchstatusshow
Switch Health Report
Report time: 06/20/2013 06:19:17 AM
Switch Name:
Sydney_ILAB_DCX-4S_LS128
IP address: 10.129.2.143
SwitchState: MARGINAL
Duration: 214:29
Power supplies monitor MARGINAL
Temperatures monitor
HEALTHY
Fans monitor
HEALTHY
WWN servers monitor
HEALTHY
CP monitor
HEALTHY
Blades monitor
HEALTHY
4 | Te j a B r o a c d e T r o u b l e s h o o t

Core Blades monitor HEALTHY
Flash monitor
HEALTHY
Marginal ports monitor HEALTHY
Faulty ports monitor
HEALTHY
Missing SFPs monitor
HEALTHY
Error ports monitor
HEALTHY
All ports are healthy

Switchshow
Explanation
Provides a general overview of logical switch status (no physical
components) plus a list of ports and their status.
The switchState should alway be online.
The switchDomain should have a unique ID in the fabric.
If zoning is configured it should be in the "ON" state.
As for the ports connected these should all be "Online" for connected and
operational ports. If you see ports showing "No_Sync" whereby the port
is notdisabled there is likely a cable or SFP/HBA problem.
If you have configured FabricWatch to enable portfencing you'll see
indications like here with port 75
Obviously for any port to work it should be enabled.
Example
Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchshow
switchName: Sydney_ILAB_DCX-4S_LS128
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:00:05:1e:52:af:00
zoning:
ON (Brocade)
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 128
Allow XISL Use: OFF
5 | Te j a B r o a c d e T r o u b l e s h o o t

LS Attributes:
Mode 0]

[FID: 128, Base Switch: No, Default Switch: Yes, Address

Index Slot Port Address Media Speed
State Proto
============================================================
0 1 0 8f0000 id 4G
Online
FC
E-Port 10:00:00:05:1e:36:02:bc
"BR48000_1_IP146" (downstream)(Trunk
master)
1 1 1 8f0100 id N8
Online
FC
F-Port 50:06:0e:80:06:cf:28:59
2 1 2 8f0200 id N8
Online
FC
F-Port 50:06:0e:80:06:cf:28:79
3 1 3 8f0300 id N8
Online
FC
F-Port 50:06:0e:80:06:cf:28:39
4 1 4 8f0400 id 4G
No_Sync FC
Disabled (Persistent)
5 1 5 8f0500 id N2
Online
FC
F-Port 50:06:0e:80:14:39:3c:15
6 1 6 8f0600 id 4G
No_Sync FC
Disabled (Persistent)
7 1 7 8f0700 id 4G
No_Sync FC
Disabled (Persistent)
8 1 8 8f0800 id N8
Online
FC
F-Port 50:06:0e:80:13:27:36:30
75 2 11 8f4b00 id N8
No_Sync FC Disabled (FOP Port State Change
threshold exceeded)
76 2 12 8f4c00 id N4
No_Light FC Disabled (Persistent)

sfpshow <slot>/<port>

Explanation

One of the most important pieces of a link irrespective of mode and
distance is the SFP. On newer hardware and software it provides a lot of info
on the overall health of the link.
With older FOS codes there could have been a discrepancy of what was
displayed in this output as to what actually was plugged in the port. The
reason was that the SFP's get polled so every now and then for status and
update information. If a port was persistent disabled it didn't update at all so
in theory you plug in another SFP but sfpshow would still display the old info.
With FOS 7.0.1 and up this has been corrected and you can also see the
latest polling time per SFP now.
The question we often get is: "What should these values be?". The
answer is "It depends". As you can imagine a shortwave 4G SFP required less
amps then a longwave 100KM SFP so in essence the SFP specs should be
consulted. As a ROT you can say that signal quality depends ont he TX power
value minus the link-loss budget. The result should be within the RX Power
specifications of the receiving SFP.
6 | Te j a B r o a c d e T r o u b l e s h o o t

Also check the Current and Voltage of the SFP. If an SFP is broken the
indication is often it draws no power at all and you'll see these two dropping
to zero.
Example
Sydney_ILAB_DCX-4S_LS128:FID128:admin> sfpshow 1/1
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding: 1 8B10B
Baud Rate: 85 (units 100 megabaud)
Length 9u: 0 (units km)
Length 9u: 0 (units 100 meters)
Length 50u (OM2): 5 (units 10 meters)
Length 50u (OM3): 0 (units 10 meters)
Length 62.5u:2 (units 10 meters)
Length Cu: 0 (units 1 meter)
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:
0
BR Min:
0
Serial No: UAF110480000NYP
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0
Alarm
Warn
low
high
low
high
Temperature: 25
Centigrade
-10
90
-5
85
Current: 6.322 mAmps
1.000
17.000
2.000
14.000
Voltage: 3290.2 mVolts
2900.0
3700.0
3000.0
3600.0
RX Power: -3.2 dBm (476.2uW) 10.0 uW
1258.9 uW
15.8 uW
1000.0 uW
TX Power: -3.3 dBm (472.9 uW) 125.9 uW
631.0 uW
158.5 uW
562.3 uW
State transitions: 1
Last poll time: 06-20-2013 EST Thu 06:48:28

Porterrshow
For link state counters this is the most useful command in the switch
however there is a perception that this command provides a "silver" bullet to
7 | Te j a B r o a c d e T r o u b l e s h o o t

solve port and link issues but that is not the case. Basically it provides a
snapshot of the content of the LESB (Link Error Status Block) of a port at that
particular point in time. It does not tell us when these counters have
accumulated and over which time frame. So in order to create a sensible
picture of the statuses of the ports we need a baseline. This baseline can be
created to reset all counters and start from zero. To do this issue the
"statsclear" command on the cli.
There are 7 columns you should pay attention to from a physical perspective.
enc_in - Encoding errors inside frames. These are errors that happen on the
FC1 with encoding 8 to 10 bits and back or, with 10G and 16G FC from 64
bits to 66 and back. Since these happen on the bits that are part of a data
frame these are counted in this column.
crc_err - An enc_in error might lead to a CRC error however this column
shows frames that have been market as invalid frames because of this crcerror earlier in the datapath. According to FC specifications it is up to the
implementation of the programmer if he wants to discard the frame right
away or mark it as invalid and send it to the destination anyway. There are
pro's and con's on both scenarios. So basically if you see crc_err in this
column it means the port has received a frame with an incorrect crc but this
occurred further upstream.
crc_g_eof - This column is the same as crc_err however the incoming frames
areNOT marked as invalid. If you see these most often the enc_in counter
increases as well but not necessarily. If the enc_in and/or enc_out column
increases as well there is a physical link issue which could be resolved by
cleaning connectors, replacing a cable or (in rare cases) replacing the SFP
and/or HBA. If the enc_in and enc_out columns do NOT increase there is an
issue between the SERDES chip and the SFP which causes the CRC to
mismatch the frame. This is a firmware issue which could be resolved by
upgrading to the latest FOS code. There are a couple of defects listed to
track these.
enc_out - Similar to enc_in this is the same encoding error however this error
was outside normal frame boundaries i.e. no host IO frame was impacted.
This may seem harmless however be aware that a lot of primitive signals and
sequences travel in between normal data frame which are paramount for
fibre-channel operations. Especially primitives which regulate credit flow.
(R_RDY and VC_RDY) and signal clock synchronization are important. If this
column increases on any port you'll likely run into performance problems
sooner or later or you will see a problem with link stability and sync-errors
(see below).

8 | Te j a B r o a c d e T r o u b l e s h o o t

Link_Fail - This means a port has received a NOS (Not Operational) primitive
from the remote side and it needs to change the port operational state to
LF1 (Link Fail 1) after which the recovery sequence needs to commence. (See
the FC-FS standards specification for that)
Loss_Sync - Loss of synchronization. The transmitter and receiver side of the
link maintain a clock synchronization based on primitive signals which start
with a certain bit pattern (K28.5). If the receiver is not able to sync its baudrate to the rate where it can distinguish between these primitives it will lose
sync and hence it cannot determine when a data frame starts.
Loss_Sig - Loss of Signal. This column shows a drop of light i.e. no light (or
insufficient RX power) is observed for over 100ms after which the port will go
into a non-active state. This counter increases often when the link-loss
budget is overdrawn. If, for instance, a TX side sends out light with -4db and
the receiver lower sensitivity threshold is -12 db. If the quality of the cable
deteriorates the signal to a value lower than that threshold, you will see the
port bounce very often and this counter increases. Another culprit is often
unclean connectors, patch-panels and badly made fibre splices. These ports
should be shut down immediately and the cabling plant be checked.
Replacing cables and/or bypassing patch-panels is often a quick way to find
out where the problem is.

The other columns are more related to protocol issues and/or performance
problems which could be the result of a physical problem but not be a cause.
In short look at these 7 columns mentioned above and check if no port
increases a value.
============================================
too_short/too_long - indicates a protocol error where SOF or EOF are
observed too soon or too late. These two columns rarely increase.
bad_eof - Bad End-of-Frame. This column indicates an issue where the sender
has observed and abnormality in a frame or it's transceiver whilst the
frameheader and portions of the payload where already send to its
destination. The only way for a transceiver to notify the destination is to
invalidate the frame. It truncates the frame and add an EOFni or EOFa to the
end. This signals the destination that the frame is corrupt and should be
discarded.
F_Rjt and F_Bsy are often seen in Ficon environments where control frames
could not be processes in time or are rejected based on fabric configuration
or fabric status.
9 | Te j a B r o a c d e T r o u b l e s h o o t

c3timout (tx/rx) - These are counters which indicate that a port is not able to
forward a frame in time to it's destination. These either show a problem
downstream of this port (tx) or a problem on this port where it has received a
frame meant to be forwarded to another port inside the sames switch. (rx).
Frames are ALWAYS discarded at the RX side (since that's where the buffers
hold the frame). The tx column is an aggregate of all rx ports that needs to
send frames via this port according to the routing tables created by FSPF.
pcs_err - Physical Coding Sublayer - These values represent encoding errors
on 16G platforms and above. Since 16G speeds have changed to 64/66 bits
encoding/decoding there is a separate control structure that takes car of this.
As a best practise is it wise to keep a trace of these port errors and create a
new baseline every week. This allows you to quickly identify errors and solve
these before they can become an problem with an elongated resolution time.
Make sure you do this fabric-wide to maintain consistency across all switches
in that fabric.

Sydney_ILAB_DCX-4S_LS128:FID128:admin> porterrshow
frames
enc
crc
crc
too
too
frjt
fbsy c3timeout
pcs
tx
rx
in
err
g_eof shrt
long
sig
tx
rx
err
0: 100.1m 53.4m
0
0
0
0
0
0
0
0
0
0
0
1: 466.6k 154.5k
0
0
0
0
0
0
0
0
0
0
0
2: 476.9k 973.7k
0
0
0
0
0
0
0
0
0
0
0
3: 474.2k 155.0k
0
0
0
0
0
0
0
0
0
0
0

bad
eof

enc

disc

out

c3

link
fail

loss

loss

sync

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Also the following commands will help to troubleshoot whether the port is
logging into the fabric
1.

nsshow/nscamshow

2.

nodefind

3.

fcping - to check the zoning

4.

perfmonitorshow

10 | Te j a B r o a c d e T r o u b l e s h o o t

Brocade bottleneckmon 101
Bottleneckmon
If a performance problem is escalated to the technical support the next
thing most probably happening is that the support guy asks you to clear the
counters, wait up to three hours while the problem is noticeable, and then
gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a
supportsave. In a device that's able to route I/O of several gigabits per
second, 32 bits aren't a huge range for counters and they will eventually
wrap if you wait too long. But a wrapped counter is worthless, because you
can't tell if and how often it wrapped. So all comparisons would be
meaningless.
Beside the wait time the whole handling of the data collections including
gathering and uploading them to the support takes precious time. And then
the support has to process and analyze them. After all these hours of
continously repeating telephone calls you get from management and internal
and/or external customers, the support guy hopefully found the cause of
your performance problem. And keeping point 1) from my first paragraph in
mind, it's most probably not even the fault of a switch*). If he makes you
aware to a slow drain device, you would now start to involve the admins
and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently
check your SAN for performance problems. Configured correctly it will
pinpoint the cause of performance problems - at least the bigger ones. The
bottleneckmon was introduced with FabricOS v6.3x and some major
limitations. But from v6.4x it eventually became a must-have by offering two
useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license
(pre-loaded on many of the IBM-branded switches and directors) you can do
that already for a long time. But the bottleneckmon offers a bit more
convenience and brings it in the proper context. The more important thing is:

11 | Te j a B r o a c d e T r o u b l e s h o o t

Latency bottleneck detection
This feature shows you most of the medium to major situations of
buffer credit starvation. If a port runs out of buffer credits, it's not allowed to
send frames over the fibre. To make a long story short if you see a latency
bottleneck reported against an F-Port you most probably found a slow drain
device in your SAN. If it's reported against an ISL, there are two possible
reasons:
1. There could be a slow drain device "down the road" - the slow drain
device could be connected to the adjacent switch or to another one
connected to it. Credit starvation typically pressures back to affect
wide areas of the fabric.
2. The ISL could have too few buffers. Maybe the link is just too long. Or
the average framesize is much smaller than expected. Or QoS is
configured on the link but you don't have QoS-Zones prioritizing your
I/O. This could have a huge negative impact! Another reason could be
a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or
at least contributing to it and should definitely be solved.

With FabricOS v7.0 the bottleneckmon was improved again. While the
core-policy which detects credit starvation situations was pretty much predefined before v7.0 you're now able to configure it in the minutest details.
We are still testing that out more in detail - for the moment I recommend to
use the defaults.
So how to use it?

At first: I highly recommend to update your switches to the latest supported
v6.4x code if possible. It's much better there than in v6.3! If you look up
bottleneckmon in the command reference, it offers plenty of parameters and
sub-commands. But in fact for most environments and performance
problems it's enough to just enable it and activate the alerting:

myswitch:admin> bottleneckmon --enable –alert
12 | Te j a B r o a c d e T r o u b l e s h o o t

That's it. It will generate messages in your switch's error log if a congestion
or a latency bottleneck was found. Pretty straightforward. If you are not sure
you can check the status with:

myswitch:admin> bottleneckmon –status
And of course there is a show command which can be used with various filter
options, but the easiest way is to just wait for the messages in the error log.
They will tell you the type of bottleneck and of course the affected port.

And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer
credit starvation the default-configured bottleneckmon can't see

13 | Te j a B r o a c d e T r o u b l e s h o o t

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close