Friday, March 26, 2010

BGP Path MTU Discovery

When a host generates Data, the packetization layer (TCP/UDP) will decide the packet size based on the MTU size of the outgoing interface. When the packet traverses along the path to ultimate destination, it may get fragmented if the MTU of outgoing interface on any router is less than the packet size. Packet fragmentation on intermittent router is always considered inefficient as it may result in below:


1. One fragment lost will result in entire packet sent from the source.

2. Introduce CPU/buffer burden.


Path MTU Discovery is introduced to reduce the chances of IP packet getting fragmented along the path. The ultimate source will use this feature to identify the lowest MTU along the path to destination and will decide the packet size.


How does PMTUD works?


When the host generates the packet, it decides the size as MTU size of the outgoing interface and set the DF bit.


Any receiving intermittent device who has MTU less than the packet size on outgoing interface have two choices: 1. Fragment and send if the DF bit is not set 2. Drop the packet and send an ICMP error message with Type=3 (Destination Unreachable); Code=4 (Fragmentation needed and DF bit set)


ICMP error message will laos have the MTU details of the outgoing interface in “Next-Hop MTU” field.


Source on receiving the error message will now send the packet with mentioned MTU. This continues till it reaches the ultimate destination.

BGP support for Path MTU Discovery

Introducing Path MTU Discovery on BGP session allows the BGP router to discover the best MTU size along the path to neighbor resulting in efficient way of exchanging BGP packets.

Consider the below scenario for further reading,


Initial TCP negotiation between R1 and R5 will have MSS value equal to (IP MTU – 40 bytes of IP header) with DF set. In our case, IP MTU is 1500 which results in 1460 as MSS. As the initial negotiation packets are very small, it mostly moves the BGP to Established state with MSS as same value.

R1#sh ip bgp nei | inc Data

Datagrams (max data segment is 1460 bytes):

After TCP negotiation, when the BGP update packets are sent, DF bit will be set wich will result in ICMP error message from R3 with 300 as Next-Hop MTU. Now the MSS is reduced to 260 (300 – 40 bytes of IP header).

R1#sh ip bgp nei | inc Data

Datagrams (max data segment is 260 bytes)

R1#

Now, with the same topology, when some intermittent device is not able to forward ICMP (some Firewall in between), end to end Path MTU discovery will not be successful. This may result in BGP session flap.

We have configured ACL on R2 to block ICMP message towards R1. So ICMP error message from R3 will not reach R1.

As soon we have BGP configured between R1 and R5, TCP negotiation will be successful and BGP will move to Established state. Now when the BGP Update is sent to R5, it will send the same with DF bit set. When a BGP router send BGP Update to any neighbor, it will not send keepalive. R3 on receiving it, will send an ICMP error message to R1 which is getting blocked in R2.

R5 after BGP session is up will except either BGP update or keepalive from R1 to reset the hold down timer. After 180 seconds, it will neither receive Update nor keepalive resulting in sending BGP Notification to R1 with error message as “Hold time expired”.

R1#sh ip bgp nei | inc Data

Datagrams (max data segment is 1460 bytes):

R1#

*Mar 22 15:16:23.033: %BGP-3-NOTIFICATION: received from neighbor 150.1.5.5 4/0 (hold time expired) 0 bytes

R1#

*Mar 22 15:16:23.033: %BGP-5-ADJCHANGE: neighbor 150.1.5.5 Down BGP Notification received

R1#

*Mar 22 15:16:55.621: %BGP-5-ADJCHANGE: neighbor 150.1.5.5 Up

R1#

*Mar 22 15:19:56.409: %BGP-3-NOTIFICATION: received from neighbor 150.1.5.5 4/0 (hold time expired) 0 bytes

R1#

*Mar 22 15:19:56.409: %BGP-5-ADJCHANGE: neighbor 150.1.5.5 Down BGP Notification received

R1#

*Mar 22 15:20:13.361: %BGP-5-ADJCHANGE: neighbor 150.1.5.5 Up

11 comments:

  1. Nice explanation. Thankx

    ReplyDelete
  2. If ICMP is blocked between the path of 2 BGP peers, then PMTUD will cease to work. So if we turn off PMTUD then, it should work again (albeit, not as efficiently if fragmentation occurs), is this correct?

    ReplyDelete
  3. Yes. When PMTUD is disabled, MSS will default to 536 bytes. While this works, it may result in inefficient use of the network, sending segment of size 536 bytes on link capable of handling around 9000 bytes.

    ReplyDelete
  4. Nagendra,

    When I did a wireshark capture, I couldn't see the df-bit set on the BGP update message.
    so do all routing protocols send their udpate pacakets with df bit set??

    ReplyDelete
    Replies
    1. Hi,

      If you have BGP PMTUD enabled (In certain IOS it is disabled by default), BGP packets will be sent with DF bit set.

      Delete
  5. awesome explanation...

    ReplyDelete
  6. Hi

    BGP Update is of big size it is dropped by R3 , then why keepalives will not reach the R5.Keepalive message is 19 bytes so it should pass by R3.Right ??

    Thanks
    Amit

    ReplyDelete
    Replies
    1. Amit,

      As explained in the blog, when a BGP Update is sent by the router, it will reset the keepalive tiemr for that neighbor and will not send any explicit keepalive for that interval. The Update itself is expected to act as a keepalive. So when a BGP Update is sent and dropped by intermittant node, the nwighbor will not reeive any update or keepalive for 180 seconds (or the holdtimer) and will reset the session.

      -Nagendra

      Delete
  7. I have configured IP MTU 1000 Explicitly between R2 & R2 interface. But I am seeing the Datagram size as 1460. Should the Datagram size have been 960 ? What could be the issue?


    R1#sh ip bg neighbors | i Data
    Datagrams (max data segment is 1460 bytes):
    !
    R5#sh ip bg neighbors | i Data
    Datagrams (max data segment is 1460 bytes):

    ReplyDelete
  8. It seems you might have changed the IP MTU under interface after the session is established. Make sure you have the MTU configured as 1000 on the egress interface of this BGP session and try clearing the session. It should help change the MTU.

    ReplyDelete