Weird dropping of packets
There's something that just baffles me right now and I'm out of ideas. Would
Basically, I have a Dell PowerEdge R415 rack server. Running xen 4.1 with a debian
squeeze dom0. On two occasions (and now is the second one), I see weird behaviours
I've got a ssh connection open from my workstation to the server and it works.
I cannot ping it nor establish new TCP connections. I can see the packets go out of
workstation interface, the switch claims it forwards it to the server, but I cannot
them on the server with tcpdump, nor do the interface statistics increase. I can
same from other workstations while others are OK.
What really baffles me is that there is an established and working ssh connection.
Initially, I was seeing the "dropped" statistics increase, and ethtool -S eth0 on
server showed some rx_fw_discard, but after increasing the rx ring buffer that went
but still same problem.
There's a bridge br0 with eth0 and the virtual interfaces for the Xen domUs,
looks fine there.
That server has a BMC with a net interface with a different MAC address. I can ping
bmc from my workstation, but not from the server. That BMC shares the same physical
network connection (I'm not sure how that works, if there's an internal bridge in
server, could it be where the problem lies?)
That's a Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet
# ethtool -i eth0
firmware-version: 5.2.3 NCSI 2.0.11
From dmesg, the link went down a few times. I think the problem started to occur
NETDEV WATCHDOG: eth0 (bnx2): transmit queue 7 timed out
appeared in dmesg.
- same problem with opensuse with Xen 4.1 and 2.6.37-xen dom0 kernel.
- upgrading to bnx2 2.0.23b from Broadcom's site improves matters (at least if I
with this one, not if I unload the old one and load this one) especially if I
the size of the receive ring buffer.
I'm under the impression that those ethernet adapters do things at level 3 and 4
worries me a bit
| Report Abuse|
| Share Page - Category: Networking - Tags: weird dropping of packets |
Glad it's looking better now :) Any further info might be useful.
Very tricky :) All I can think of is running strace and seeing if it's a Xen
Another thought, have you checked for Xen networking bugs ?
That's a tricky problem then !
I'd be most concerned with "The ifconfig and ethtool statistics increase, but a tcpdump shows
nothing (!?)." :-) and probably figure out how to capture whatever is increasing the
Strace on what :-) ?
I might be speaking to soon, but I think I finally found a/the solution. To sum up, it now
1- bnx2 driver updated to latest version from Broadcom (among other things, it allows a
bigger receive ring buffer)
2- increased the rx ringer buffer to maximum size (4080)
3- increase coalescence rx-frames and rx-usecs to get fewer interrupts
4- allocated and reserved 2 CPUs to the dom0 (dom0_max_vcpus=2 dom0_vcpus_pin added to xen
boot args, cpu-pins for guests not to include the first 2 cpus)
5- increased dom0 scheduling weight: xm sched-credit -d Domain-0 -w 512
Of those, only "4" I know is necessary. I've not tried with reverting the other ones, but now
that I've got something running at last, I don't want to break it.
I now even have a domU with PCI passthrough to one of the ethernet cards/ports and it works
Some questions for food for thought:
When you ping are you pinging the IP directly from the same segment ? Are you sure nothing
like rp_filter is getting in the way ? How are you attempting to start up new TCP connection
You're not using Telnet to SMTP to try and start up new sessions are you ?
How is your SSH session started ?
A few things spring to mind to hopefully narrow it down an little:
Check MTU across everything. As I'm sure you know tiny data transfers usually work with a
misconfigured MTU but anything else drops out and that's what you usually notice.
Try a USB NIC instead of the one built-in and rule that and the driver out. It'll take 2
minutes and if you don't mind it being 100Mbit instead of 1000Mbit then you can do it at any
Try another switch port or switch ideally and cable.
You said this has happened twice. How far apart were the occurrences ? Think in terms of load
and the time of day ... cron or backups and bandwidth on the switch's backplane etc.