HostedDB - Dedicated UNIX Servers

MS Windows 2000 TCP/IP Implementation Details MS Windows 2000 TCP/IP Implementation Details

Operating System

White Paper

By Dave MacDonald and Warren Barkley

Abstract

This white paper describes the Microsoft® Windows® 2000 operating system TCP/IP implementation details, and is a supplement to the Microsoft Windows 2000 TCP/IP manuals. The Microsoft TCP/IP protocol suite is examined from the bottom up. Throughout the paper, network traces are used to illustrate key concepts. These traces were gathered and formatted using Microsoft Network Monitor, a software-based protocol tracing and analysis tool included in the Microsoft Systems Management Server product. The intended audience for this paper is network engineers and support professionals who are already familiar with TCP/IP.

Introduction

Microsoft has adopted TCP/IP as the strategic enterprise network transport for its platforms. In the early 1990s, Microsoft started an ambitious project to create a TCP/IP stack and services that would greatly improve the scalability of Microsoft networking. With the release of the Microsoft® Windows NT® 3.5 operating system, Microsoft introduced a completely rewritten TCP/IP stack. This new stack was designed to incorporate many of the advances in performance and ease of administration that were developed over the past decade. The stack is a high-performance, portable 32-bit implementation of the industry-standard TCP/IP protocol. It has evolved with each version of Windows NT to include new features and services that enhance performance and reliability.

The goals in designing the TCP/IP stack were to make it:

This paper describes Windows 2000 implementation details and is a supplement to the Microsoft Windows 2000 TCP/IP manuals. It examines the Microsoft TCP/IP implementation from the bottom up and is intended for network engineers and support professionals who are familiar with TCP/IP.

This paper uses network traces to help illustrate concepts. These traces were gathered and formatted using Microsoft Network Monitor 2.0, a software-based protocol tracing and analysis tool included in the Microsoft Systems Management Server product. Windows 2000 Server includes a reduced functionality version of Network Monitor. The primary difference between this version and the Systems Management Server version is that the limited version can only capture frames that would normally be seen by the computer that it is installed on, rather than all frames that pass over the network (which requires the adapter to be in promiscuous mode). It also does not support connecting to remote Network Monitor Agents.

Capabilities and Functionality

Overview

The TCP/IP suite for Windows 2000 was designed to make it easy to integrate Microsoft systems into large-scale corporate, government, and public networks, and to provide the ability to operate over those networks in a secure manner. Windows 2000 is an Internet-ready operating system.

Support for Standard Features

Windows 2000 supports the following standard features:

Performance Enhancements

In addition, Windows 2000 has the following performance enhancements:

Services Available

The Windows 2000 Server family of operating systems provides the following services:

Feature Comparison Table for Microsoft TCP/IP Versions

The table below lists features and the operating system versions that they are present in as a reference. Features are described in more detail throughout this document.

Table 1 N=No, Y=Yes, and D=Disabled by Default
Product

Windows 95
Windows 95 Winsock 2
Windows 98
Windows 98 SE
Windows NT 4.0 SP5
Windows 2000
Dead Gateway Detect

N

N

Y

Y

Y

Y

VJ Fast Retransmit

N

Y

Y

Y

Y

Y

AutoNet

N

N

Y

Y

N

Y

SACK (Selective ACK)

N

Y

Y

Y

N

Y

Jumbo frame support

Y

Y

Y

Y

Y

Y

Large Windows

N

D

D

D

N

D

Dynamic DNS

N

N

N

N

N

Y

Media Sense

N

N

N

N

N

Y

Wake-On-LAN

N

N

N

N

N

Y

IP Forwarding

N

N

N

D

D

D

NAT

N

N

N

D

N

D

Kerberos v5

N

N

N

N

N

Y

IPSec (IP Security)

N

N

N

N

N

Y

PPTP

N

N

Y

Y

Y

Y

L2TP

N

N

N

N

N

Y

IP Helper API

N

N

Y

Y

Y

Y

Winsock2 API

N

Y

Y

Y

Y

Y

GQoS API

N

N

Y

Y

N

Y

IP Filtering API

N

N

N

N

N

Y

Firewall Hooks

N

N

N

N

N

Y

Packet Scheduler

N

N

N

N

N

D

RSVP

N

N

Y

Y

N

Y

ISSLO

N

N

Y

Y

N

Y

Trojan Filtering

N

N

N

N

D

D

Blocking src routing

N

N

N

Y

Y

Y

ICMP Router Discovery

N

Y

Y

Y

D

D

Offload-TCP

N

N

N

N

N

Y

Offload-IPSec

N

N

N

N

N

Y

Internet RFCs Supported by Microsoft Windows 2000 TCP/IP

Requests for Comments (RFCs) are a constantly evolving series of reports, proposals for protocols, and protocol standards used by the Internet community. You can use FTP to obtain RFCs from any of the following:

Table 2 RFCs supported by this version of Microsoft TCP/IP
RFC
Title
768

User Datagram Protocol (UDP)

783

Trivial File Transfer Protocol (TFTP)

791

Internet Protocol (IP)

792

Internet Control Message Protocol (ICMP)

793

Transmission Control Protocol (TCP)

816

Fault Isolation and Recovery

826

Address Resolution Protocol (ARP)

854

Telnet Protocol (TELNET)

862

Echo Protocol (ECHO)

863

Discard Protocol (DISCARD)

864

Character Generator Protocol (CHARGEN)

865

Quote of the Day Protocol (QUOTE)

867

Daytime Protocol (DAYTIME)

894

IP over Ethernet

919, 922

IP Broadcast Datagrams (broadcasting with subnets)

950

Internet Standard Subnetting Procedure

959

File Transfer Protocol (FTP)

1001, 1002

NetBIOS Service Protocols

1065, 1035, 1123, 1886

Domain Name System (DNS)

1042

A Standard for the Transmission of IP Datagrams over IEEE 802 Networks

1055

Transmission of IP over Serial Lines (IP-SLIP)

1112

Internet Group Management Protocol (IGMP)

1122, 1123

Host Requirements (communications and applications)

1144

Compressing TCP/IP Headers for Low-Speed Serial Links

1157

Simple Network Management Protocol (SNMP)

1179

Line Printer Daemon Protocol

1188

IP over FDDI

1191

Path MTU Discovery

1201

IP over ARCNET

RFC

Title

1256

ICMP Router Discovery Messages

1323

TCP Extensions for High Performance (see the TCP1323opts registry parameter)

1332

PPP Internet Protocol Control Protocol (IPCP)

1518

Architecture for IP Address Allocation with CIDR

1519

Classless Inter-Domain Routing (CIDR): An Address Assignment and Aggregation Strategy

1534

Interoperation Between DHCP and BOOTP

1542

Clarifications and Extensions for the Bootstrap Protocol

1552

PPP Internetwork Packet Exchange Control Protocol (IPXCP)

1661

The Point-to-Point Protocol (PPP)

1662

PPP in HDLC-like Framing

1748

IEEE 802.5 MIB using SMIv2

1749

IEEE 802.5 Station Source Routing MIB using SMIv2

1812

Requirements for IP Version 4 Routers

1828

IP Authentication using Keyed MD5

1829

ESP DES-CBC Transform

1851

ESP Triple DES-CBC Transform

1852

IP Authentication using Keyed SHA

1886

DNS Extensions to Support IP Version 6

1994

PPP Challenge Handshake Authentication Protocol (CHAP)

1995

Incremental Zone Transfer in DNS

1996

A Mechanism for Prompt DNS Notification of Zone Changes

2018

TCP Selective Acknowledgment Options

2085

HMAC-MD5 IP Authentication with Replay Prevention

2104

HMAC: Keyed Hashing for Message Authentication

2131

Dynamic Host Configuration Protocol

2136

Dynamic Updates in the Domain Name System (DNS UPDATE)

2181

Clarifications to the DNS Specification

2205

Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification

2236

Internet Group Management Protocol, Version 2

2308

Negative Caching of DNS Queries (DNS NCACHE)

2401

Security Architecture for the Internet Protocol

2401

Security Architecture for the Internet Protocol

2402

IP Authentication Header

2406

IP Encapsulating Security Payload (ESP)

2581

TCP Congestion Control

Architectural Model

Overview

The Microsoft TCP/IP suite contains core protocol elements, services, and the interfaces between them. The Transport Driver Interface (TDI) and the Network Device Interface Specification (NDIS) are public, and their specifications are available from Microsoft.1 http://www.microsoft.com and ftp://ftp.microsoft.com).

In addition, there are a number of higher-level interfaces available to user-mode applications. The most commonly-used are Windows Sockets, remote procedure call (RPC), and NetBIOS.

Figure 1 The Windows 2000 TCP/IP network model

Plug and Play

Windows 2000 introduces support for Plug and Play. Plug and Play has the following capabilities and features:

Plug and Play operation does not require Plug and Play hardware. To the degree possible, the first two bullets above apply to legacy hardware, as well as Plug and Play hardware. In some cases, orderly enumeration of legacy devices is not possible because the detection methods are destructive or inordinately time-consuming.

The primary impact that Plug and Play support has on protocol stacks is that network interfaces can come and go at any time. The Windows 2000 TCP/IP stack and related components have been adapted to support Plug and Play.

The NDIS Interface and Below

Microsoft networking protocols use the Network Device Interface Specification (NDIS) to communicate with network card drivers. Much of the OSI model link layer functionality is implemented in the protocol stack. This makes development of network card drivers much simpler.

Network Driver Interface Specification (3.1 through 5.0)

NDIS 3.1 supports basic services that allow a protocol module to send raw packets over a network device and allow that same module to be notified of incoming packets received by a network device.

NDIS 4.0 added the following new features to NDIS 3.1:

NDIS 5.0 includes all functionality defined in NDIS 4.0, plus the following extensions:

NDIS can power down network adapters when the system requests a power level change. Either the user or the system can initiate this request. For example, the user may want to put the computer in sleep mode, or the system may request a power level change based on keyboard or mouse inactivity. In addition, disconnecting the network cable can initiate a power-down request if the network interface card (NIC) supports this functionality. In this case, the system waits a configurable time period before powering down the NIC because the disconnect could be the result of temporary wiring changes on the network, rather than the disconnection of a cable from the network device itself.

NDIS power management policy is no network activity–based. This means that all overlying network components must agree to the request before the NIC can be powered down. If there are any active sessions or open files over the network, the power-down request can be refused by any or all of the components involved.

The computer can also be awakened from a lower power state, based on network events. A wakeup signal can be caused by:

At driver initialization, NDIS queries the capabilities of the miniport to determine if it supports such things as Magic Packet, pattern match, or link change wakeups, and to determine the lowest required power state for each wakeup method. The network protocols then query the miniport capabilities. At run time, the protocol sets the wakeup policy, using object identifiers (OIDs), such as Enable Wakeup, Set Packet Pattern, and Remove Packet Pattern.

Currently, Microsoft TCP/IP is the only Microsoft protocol stack that supports network power management. It registers the following packet patterns at miniport initialization:

NDIS-compliant drivers are available for a wide variety of NICs from many vendors. The NDIS interface allows multiple protocol drivers of different types to bind to a single NIC driver and allows a single protocol to bind to multiple NIC drivers. The NDIS specification describes the multiplexing mechanism used to accomplish this. Bindings can be viewed or changed from the Windows Network Connections folder.

Windows 2000 TCP/IP provides support for:

The goals for these new features include the following:

Link Layer Functionality

Link layer functionality is divided between the network interface card/driver combination and the low-level protocol stack driver. The network card/driver combination filters are based on the destination media access control (MAC) address of each frame.

Normally, the hardware filters out all incoming frames except those containing one of the following destination addresses:

Because this first filtering decision is made by the hardware, the NIC discards any frames that do not meet the filter criteria without incurring any CPU processing. All frames (including broadcasts) that pass the hardware filter are then passed up to the NIC driver through a hardware interrupt.2 The NIC driver is software that runs on the computer, so any frames that make it this far require some CPU time to process. The NIC driver brings the frame into system memory from the interface card. Then the frame is indicated (passed up) to the appropriate bound transport driver(s). The NDIS 5.0 specification provides more detail on this process.

Frames are passed up to all bound transport drivers in the order that they are bound.

As a packet traverses a network or series of networks, the source media access control address is always that of the NIC that placed it on the media, and the destination media access control address is that of the NIC that is intended to pull it off the media. This means that, in a routed network, the source and destination media access control address changes with each hop through a network-layer device (router or Layer 3 switch).

Maximum Transmission Unit (MTU)

Each media type has a maximum frame size that cannot be exceeded. The link layer is responsible for discovering this MTU and reporting it to the protocols above. NDIS drivers may be queried for the local MTU by the protocol stack. Knowledge of the MTU for an interface is used by upper layer protocols, such as TCP, that optimize packet sizes for each media automatically. For details, see the discussion of TCP Path Maximum Transmission Unit (PMTU) discovery in the "Transmission Control Protocol (TCP)" section of this paper.

If a NIC driver—such as an ATM driver—uses LAN emulation mode, it may report that it has an MTU that is higher than what is expected for that media type. For example, it may emulate Ethernet but report an MTU of 9180 bytes. Windows NT and Windows 2000 accept and use the MTU size reported by the adapter, even when it exceeds the normal MTU for a given media type.

Sometimes the MTU reported to the protocol stack may be less than what would be expected for a given media type. For instance, use of the 802.1p standard for QoS over Ethernet often (this is hardware dependent) reduces the MTU reported by 4 bytes due to larger link-layer headers.

Core Protocol Stack Components and the TDI Interface

The core protocol stack components are those shown between the NDIS and TDI interfaces in figure 1. They are implemented in the Windows 2000 Tcpip.sys driver. The Microsoft stack is accessible through the TDI interface and the NDIS interface. The Winsock2 interface also provides some support for direct access to the protocol stack.

Address Resolution Protocol (ARP)

ARP performs IP address-to-Media Access Control (MAC) address resolution for outgoing packets. As each outgoing IP datagram is encapsulated in a frame, source and destination media access control addresses must be added. Determining the destination media access control address for each frame is the responsibility of ARP.

ARP compares the destination IP address on every outbound IP datagram to the ARP cache for the NIC over which the frame will be sent. If there is a matching entry, the MAC address is retrieved from the cache. If not, ARP broadcasts an ARP Request Packet on the local subnet, requesting that the owner of the IP address in question reply with its media access control address. If the packet is going through a router, ARP resolves the media access control address for that next-hop router, rather than the final destination host. When an ARP reply is received, the ARP cache is updated with the new information, and it is used to address the packet at the link layer.

ARP Cache

You can use the ARP utility to view, add, or delete entries in the ARP cache. Examples are shown below. Entries added manually are static and are not automatically removed from the cache, whereas dynamic entries are removed from the cache (see the "ARP Cache Aging" section for more information).

The arp command can be used to view the ARP cache, as shown here:

C:\>arp –a
Interface: 199.199.40.123
Internet Address Physical Address Type
199.199.40.1 00-00-0c-1a-eb-c5 dynamic
199.199.40.124 00-dd-01-07-57-15 dynamic
Interface: 10.57.8.190
Internet Address Physical Address Type
10.57.9.138 00-20-af-1d-2b-91 dynamic

The computer in this example is multihomed—has more than one NIC—so there is a separate ARP cache for each interface.

In the following example, the command arp –s is used to add a static entry to the ARP cache used by the second interface for the host whose IP address is 10.57.10.32 and whose NIC address is 00608C0E6C6A:

C:\>arp -s 10.57.10.32 00-60-8c-0e-6c-6a 10.57.8.190

C:\>arp -a

Interface: 199.199.40.123
Internet Address Physical Address Type
199.199.40.1 00-00-0c-1a-eb-c5 dynamic
199.199.40.124 00-dd-01-07-57-15 dynamic

Interface: 10.57.8.190
Internet Address Physical Address Type
10.57.9.138 00-20-af-1d-2b-91 dynamic
10.57.10.32 00-60-8c-0e-6c-6a static

ARP Cache Aging

Windows NT and Windows 2000 adjust the size of the ARP cache automatically to meet the needs of the system. If an entry is not used by any outgoing datagram for two minutes, the entry is removed from the ARP cache. Entries that are being referenced are removed from the ARP cache after ten minutes. Entries added manually are not removed from the cache automatically. A new registry parameter, ArpCacheLife, was added in Windows NT 3.51 Service Pack 4 to allow more administrative control over aging. This parameter is described in Appendix A.

Use the command arp –d to delete entries from the cache, as shown below:

C:\>arp -d 10.57.10.32

C:\>arp -a

Interface: 199.199.40.123
Internet Address Physical Address Type
199.199.40.1 00-00-0c-1a-eb-c5 dynamic
199.199.40.124 00-dd-01-07-57-15 dynamic

Interface: 10.57.8.190
Internet Address Physical Address Type
10.57.9.138 00-20-af-1d-2b-91 dynamic

ARP queues only one outbound IP datagram for a specified destination address while that IP address is being resolved to a media access control address. If a User Datagram Protocol (UDP)-based application sends multiple IP datagrams to a single destination address without any pauses between them, some of the datagrams may be dropped if there is no ARP cache entry already present. An application can compensate for this by calling the iphlpapi.dll routine SendArp() to establish an ARP cache entry, before sending the stream of packets. See the Microsoft Knowledge Base article Q193059 or the Platform SDK for IP Helper API details.

Internet Protocol (IP)

IP is the mailroom of the TCP/IP stack, where packet sorting and delivery take place. At this layer, each incoming or outgoing packet is referred to as a datagram. Each IP datagram bears the source IP address of the sender and the destination IP address of the intended recipient. Unlike the media access control addresses, the IP addresses in a datagram remain the same throughout a packet's journey across an internetwork. IP layer functions are described below.

Routing

Routing is a primary function of IP. Datagrams are handed to IP from UDP and TCP above, and from the NIC(s) below. Each datagram is labeled with a source and destination IP address. IP examines the destination address on each datagram, compares it to a locally maintained route table, and decides what action to take. There are three possibilities for each datagram:

The route table maintains four different types of routes. They are listed below in the order that they are searched for a match:

  1. Host (a route to a single, specific destination IP address)
  2. Subnet (a route to a subnet)
  3. Network (a route to an entire network)
  4. Default (used when there is no other match)

To determine a single route to use to forward an IP datagram, IP uses the following process:

  1. For each route in the routing table, IP performs a bit-wise logical AND between the destination IP address and the netmask. IP compares the result with the network destination for a match. If they match, IP marks the route as one that matches the destination IP address.
  2. From the list of matching routes, IP determines the route that has the most bits in the netmask. This is the route that matches the most bits to the destination IP address and is therefore the most specific route for the IP datagram. This is known as finding the longest or closest matching route.
  3. If multiple closest matching routes are found, IP uses the route with the lowest metric. If multiple closest matching routes with the lowest metric are found, IP can choose to use any of those routes.

You can use the route print command to view the route table from the command prompt, as shown below:

C:\>route print
===============================================================
Interface List
0x1 ........................... MS TCP Loopback interface
0x2 ...00 a0 24 e9 cf 45 ...... 3Com 3C90x Ethernet Adapter
0x3 ...00 53 45 00 00 00 ...... NDISWAN Miniport
0x4 ...00 53 45 00 00 00 ...... NDISWAN Miniport
0x5 ...00 53 45 00 00 00 ...... NDISWAN Miniport
0x6 ...00 53 45 00 00 00 ...... NDISWAN Miniport
==================================================================
==================================================================
Active Routes:
Network Destination Netmask Gateway Interface Metric
0.0.0.0 0.0.0.0 10.99.99.254 10.99.99.1 1
10.99.99.0 255.255.255.0 10.99.99.1 10.99.99.1 1
10.99.99.1 255.255.255.255 127.0.0.1 127.0.0.1 1
10.255.255.255 255.255.255.255 10.99.99.1 10.99.99.1 1
127.0.0.0 255.0.0.0 127.0.0.1 127.0.0.1 1
224.0.0.0 224.0.0.0 10.99.99.1 10.99.99.1 1
255.255.255.255 255.255.255.255 10.99.99.1 10.99.99.1 1
Default Gateway: 10.99.99.254
=================================================================
Persistent Routes:
None

The route table above is for a computer with the class A IP address of 10.99.99.1, the subnet mask of 255.255.255.0, and the default gateway of 10.99.99.254. It contains the following eight entries:

The Default Gateway is the currently active default gateway. This is useful to know when multiple default gateways are configured.

On this host, if a packet is sent to 10.99.99.40, the closest matching route is the local subnet route (10.99.99.0 with the mask of 255.255.255.0). The packet is sent via the local interface 10.99.99.1. If a packet is sent to 10.200.1.1, the closest matching route is the default route. In this case, the packet is forwarded to the default gateway.

The route table is maintained automatically in most cases. When a host initializes, entries for the local network(s), loopback, multicast, and configured default gateway are added. More routes may appear in the table as the IP layer learns of them. For instance, the default gateway for a host may advise it of a better route to a specific network, subnet, or host, using ICMP, which is explained later in this white paper. Routes also may be added manually using the route command, or by a routing protocol. The -p (persistent) switch can be used with the route command to specify permanent routes. Persistent routes are stored in the registry under the registry key

HKEY_LOCAL_MACHINE
\SYSTEM
\CurrentControlSet
\Services
\Tcpip
\Parameters
\PersistentRoutes

Windows 2000 TCP/IP introduces a new metric configuration option for default gateways. This metric allows better control of which default gateway is active at any particular time. The default value for the metric is 1. A route with a lower metric value is preferred to a route with a higher metric. In the case of default gateways, the computer will use the one with the lowest metric unless it appears to be inactive, in which case dead gateway detection may trigger a switch to the next lowest metric default gateway in the list. Default gateway metrics can be set using TCP/IP Advanced Configuration properties. DHCP servers provide a base metric, and a list of default gateways. If a DHCP server provides a base of 100, and a list of three default gateways, the gateways will be configured with metrics of 100, 101, and 102 respectively. A DHCP-provided base does not apply to statically configured default gateways.

Most Autonomous System (AS) routers use a protocol such as Routing Information Protocol (RIP) or Open Shortest Path First (OSPF) to exchange routing tables with each other. Windows 2000 Server includes support for these protocols. Windows 2000 Professional includes support for silent RIP.

By default, Windows-based systems do not behave as routers and do not forward IP datagrams between interfaces. However, the Routing and Remote Access service is included in Windows 2000 Server. It can be enabled and configured to provide full multiprotocol routing services.

To administer the Routing and Remote Access

  1. On the Start menu, point to Programs.
  2. Point to Administrative Tools, and then click Routing and Remote Access.

When running multiple logical subnets on the same physical network, the following command can be used to tell IP to treat all subnets as local and to use ARP directly for the destination:

route add 0.0.0.0 MASK 0.0.0.0 <my local ip address>

Thus, packets destined for non-local subnets are transmitted directly onto the local media instead of being sent to a router. In essence, the local interface card can be designated as the default gateway. This can be useful where several class C networks are used on one physical network with no router to the outside world, or in a proxy-ARP environment.

Duplicate IP Address Detection

Duplicate address detection is an important feature. When the stack is first initialized or when a new IP address is added, gratuitous ARP requests are broadcast for the IP addresses of the local host. The number of ARPs to send is controlled by the ArpRetryCount registry parameter, which defaults to 3. If another host replies to any of these ARPs, the IP address is already in use. When this happens, the Windows-based computer still boots; however, the interface containing the offending address is disabled, a system log entry is generated, and an error message is displayed. If the host that is defending the address is also a Windows-based computer, a system log entry is generated, and an error message is displayed on that computer. In order to repair the damage possibly done to the ARP caches on other computers, the offending computer re-broadcasts another ARP, restoring the original values in the ARP caches of the other computers.

A computer using a duplicate IP address can be started when it is not attached to the network, in which case no conflict would be detected. However, if it is then plugged into the network, the first time that it sends an ARP request for another IP address, any Windows NT–based computer with a conflicting address detects the conflict. The computer detecting the conflict displays an error message and logs a detailed event in the system log. A sample event log entry is shown below:

The system detected an address conflict for IP address 199.199.40.123 with the system having network hardware address 00:DD:01:0F:7A:B5. Network operations on this system may be disrupted as a result.

DHCP-enabled clients inform the DHCP server when an IP address conflict is detected and, instead of invalidating the stack, they request a new address from the DHCP server and request that the server flag the conflicting address as bad. This capability is commonly known as DHCP Decline support.

Multihoming

When a computer is configured with more than one IP address, it is referred to as a multihomed system. Multihoming is supported in three different ways:

When an IP datagram is sent from a multihomed host, it is passed to the interface with the best apparent route to the destination. Accordingly, the datagram may contain the source IP address of one interface in the multihomed host, yet be placed on the media by a different interface. The source media access control address on the frame is that of the interface that actually transmitted the frame to the media, and the source IP address is the one that the sending application sourced it from, not necessarily one of the IP addresses associated with the sending interface in the Network Connections UI.

When a computer is multihomed with NICs attached to disjoint networks (networks that are separate from and unaware of each other, such as a remote access-connected network and a local connection), routing problems may arise. It is often necessary to set up static routes to remote networks in this situation.

When configuring a computer to be multihomed on two disjoint networks, the best practice is to set the default gateway on the main or largest and least-known network. Then, either add static routes or use a routing protocol to provide connectivity to the hosts on the smaller or better-known network. Avoid configuring a different default gateway on each side; this can result in unpredictable behavior and loss of connectivity.

Note There can only be one active default gateway for a computer at any moment in time.

More details on name registration, resolution, and choice of NIC on outbound datagrams with multihomed computers are provided in the "Transmission Control Protocol (TCP)," "NetBIOS over TCP/IP," and "Windows Sockets" sections of this paper.

Classless Interdomain Routing (CIDR)

CIDR, described in RFCs 1518 and 1519, removes the concept of class from the IP address assignment and management process. In place of predefined, well-known boundaries, CIDR allocates addresses defined by a starting address and a range, which makes more efficient use of available space. The range defines the network part of the address. For example an assignment from an ISP to a corporate client might be expressed as 10.57.1.128 /25. This would result in a 128-address block for local use, with the upper 25 bits being the network identifier part of the address. A legacy, class-full allocation would be expressed as <net>.0.0.0 /8, <net>.<net>.0.0 /16, or <net>.<net>.<net>.0 /24. As these are reclaimed, they will be reallocated using classless CIDR techniques.

Given the installed base of class-full systems, the initial implementation of CIDR was to concatenate pieces of the Class C space. This process was called supernetting. Supernetting can be used to consolidate several class C network addresses into one logical network. To use supernetting, the IP network addresses that are to be combined must share the same high-order bits, and the subnet mask is shortened to take bits away from the network portion of the address and add them to the host portion. For example, the class C network addresses 199.199.4.0, 199.199.5.0, 199.199.6.0, and 199.199.7.0 can be combined by using a subnet mask of 255.255.252.0 for each:

NET 199.199.4 (1100 0111.1100 0111.0000 0100.0000 0000)
NET 199.199.5 (1100 0111.1100 0111.0000 0101.0000 0000)
NET 199.199.6 (1100 0111.1100 0111.0000 0110.0000 0000)
NET 199.199.7 (1100 0111.1100 0111.0000 0111.0000 0000)
MASK 255.255.252.0 (1111 1111.1111 1111.1111 1100.0000 0000)

When routing decisions are made, only the bits covered by the subnet mask are used, thus making all these addresses appear to be part of the same network for routing purposes. Any routers in use must also support CIDR and may require special configuration. Windows 2000 TCP/IP includes support for 0's and 1's subnets as described in RFC 1878.

IP Multicasting

IP multicasting is used to provide efficient multicast services to clients that may not be located on the same network segment. Windows Sockets applications can join a multicast group to participate in a wide-area conference, for instance.

Windows 2000 is level-2 (send and receive) compliant with RFC 1112. IGMP is the protocol used to manage IP multicasting, which is described later in this document.

IP over ATM

Windows 2000 introduces support for IP over ATM. RFC 1577 (and successors) define the basic operation of an IP over ATM network, or more precisely, a Logical IP Subnet over an ATM network. A Logical IP Subnet (or LIS) is a set of IP hosts that can communicate directly with each other. Two hosts belonging to different Logical IP Subnets can communicate only through an IP router that is a member of both subnets.

ATM Address Resolution

Because an ATM network is non-broadcast, ARP broadcasts (as used by Ethernet or Token Ring) are not a suitable solution. Instead, a dedicated Address Resolution Protocol server (or ARP server) is used to provide IP-to-ATM address resolution.

One of the stations in a LIS is designated as an ARP server (and the ARP server software is loaded on it). Stations that use the services of the ARP server are referred to as ARP clients. All IP stations within a LIS are ARP clients. Each ARP client is configured with the ATM address of the ARP server. When an ARP client starts up, it makes an ATM connection to the ARP server, and sends a packet to the server that contains the client's IP and ATM addresses. The ARP server builds a table of IP-address-to-ATM-address mappings. When a client has an IP packet to be sent to another client (whose IP address is known but whose ATM address is unknown), it first queries the ARP server for the ATM address of the desired client. When it receives a reply that contains the desired ATM address, the client establishes a direct ATM connection to the target client and sends IP packets for that client on this connection.

The clients close any ATM connection, including the connection to the server, if the connections are inactive. All clients refresh their IP and ATM address information with the server periodically (the default is 15 minutes). An entry that is not refreshed after 20 minutes (by default) is purged by the server. The ATM ARP client and ARP server both support a number of adjustable registry parameters, which are listed in Appendix A.

Internet Control Message Protocol (ICMP)

ICMP is a maintenance protocol specified in RFC 792 and is normally considered part of the IP layer. ICMP messages are encapsulated within IP datagrams, so that they can be routed throughout an internetwork. Windows NT and Windows 2000 use ICMP to:

ICMP Router Discovery

Windows 2000 can perform router discovery as specified in RFC 1256. Router discovery provides an improved method of configuring and detecting default gateways. Instead of using manually- or DHCP-configured default gateways, hosts can dynamically discover routers on their subnet. If the primary router fails or the network administrators change router preferences, hosts can automatically switch to a backup router.

When a host that supports router discovery initializes, it joins the all-systems IP multicast group (224.0.0.1), and then listens for the router advertisements that routers send to that group. Hosts can also send router-solicitation messages to the all-routers IP multicast address (224.0.0.2) when an interface initializes to avoid any delay in being configured. Windows 2000 sends a maximum of three solicitations at intervals of approximately 600 milliseconds.

The use of router discovery is controlled by the PerformRouterDiscovery and SolicitationAddressBCast registry parameters, and it defaults to DHCP controlled in Windows 2000.

Setting SolicitationAddressBCast to 1 causes router solicitations to be broadcast, instead of multicast, as described in the RFC.

Maintaining Route Tables

When a Windows-based computer is initialized, the route table normally contains only a few entries. One of those entries specifies a default gateway. Datagrams that have a destination IP address with no better match in the route table are sent to the default gateway. However, because routers share information about network topology, the default gateway may know a better route to a given address. When this is the case, then upon receiving a datagram that could take the better path, the router forwards the datagram normally. It then advises the sender of the better route, using an ICMP Redirect message. These messages can specify redirection for one host, a subnet, or for an entire network. When a Windows-based computer receives an ICMP redirect, a validity check is performed to be sure that it came from the first-hop gateway in the current route, and that the gateway is on a directly connected network. If so, a host route with a 10-minute lifetime is added to the route table for that destination IP address. If the ICMP redirect did not come from the first-hop gateway in the current route, or if that gateway is not on a directly connected network, the ICMP redirect is ignored.

Path Maximum Transmission Unit (PMTU) Discovery

TCP employs Path Maximum Transmission Unit (PMTU) discovery, as described later in the "Transmission Control Protocol (TCP)" section of this paper. The mechanism relies on ICMP Destination Unreachable messages.

Use of ICMP to Diagnose Problems

Quality of Service (QoS) and Resource Reservation Protocol (RSVP)

Another new feature in Windows 2000 is support for QoS. Windows 2000 supports several QoS mechanisms such as the Resource reServation Protocol (RSVP), Differentiated Services (DiffServ), IEEE 802.1p, ATM QoS, and so on. The QoS mechanisms supported in Windows 2000 are abstracted through a simple Generic QoS (GQoS) API. An overview of support for QoS from the stack and related system components is presented here.

The GQoS API is an extension to the Winsock programming interface. It includes APIs and system components that provide applications with a method of reserving network bandwidth between client and server. Windows 2000 automatically maps GQoS requests to QoS mechanisms such as RSVP, Diffserv, 802.1p or ATM QoS. RSVP is a layer 3 signaling protocol that is used to reserve bandwidth for individual flows on a network. RSVP is a per-flow QoS mechanism because it sets up a reservation for each flow. Diffserv is another layer 3 QoS mechanism. Diffserv defines 6 bits in the IP header that determine how the IP packet is prioritized3 . Diffserv traffic can be prioritized into 64 possible classes known as Per Hop Behaviors (PHBs). 802.1p, on the other hand, is a layer 2 QoS mechanism that defines how layer 2 devices such as Ethernet switches should prioritize traffic. 802.1p defines 8 priority classes ranging from 0 to 7. DiffServ and 802.1p are called aggregate QoS mechanisms because they classify all traffic into a finite number of priority classes.

The following sequence of events characterize an application's interaction with GQoS:

  1. The application requests QoS in abstract terms via GQoS.
  2. The application's request translates into RSVP signaling messages. RSVP signaling messages go out onto the network and reserve bandwidth on all RSVP-aware nodes in the network path.
  3. In addition to setting up reservations, RSVP messages are subject to scrutiny by policy servers on the network. Policy servers can reject the RSVP request if it is in violation of network policy. This gives the network administrator a means of enforcing who gets QoS.
  4. Once the RSVP reservation has been installed, Windows 2000 starts marking all outgoing packets for that flow with the appropriate DiffServ class and 802.1p priority.
  5. As the traffic from the flow makes its way through the network, it gets the benefit of 802.1p prioritization in 802.1p-enabled Ethernet switches, the benefit of RSVP reservations in RSVP-enabled routers, and the benefits of DiffServ prioritization in DiffServ-enabled clouds in the network.

There are several other QoS mechanisms—such as Integrated Services over ATM (ISATM), which automatically maps GQoS requests to ATM QoS on Classical IP over ATM networks. Integrated Services Over Low Bit Rate (ISSLOW) is another QoS mechanism that improves latency for prioritized traffic on slow WAN links. In addition to the GQoS API, a control or management application has access to traffic control functionality via the Traffic Control (TC) API. The TC API allows a control or management application to assist in providing some quality of service for non-QoS-enabled applications. Windows 2000 also provides a policy server called the QoS Admission Control Service (QoS ACS). The QoS ACS allows network administrators to control who gets QoS on the network. The QoS ACS also exposes an API called the Local Policy Module (LPM) API. The LPM API allows ISVs to build customized policy modules that add to the policy enforcement functionality in the QoS ACS.

Figure 2, below, illustrates the system components involved in QoS and RSVP. GQoS is a QoS provider that can invoke RSVP signaling, trigger traffic control, and provide notification of events to the application. Rsvp.exe is responsible for RSVP signaling to or from the network, and for invoking Traffic.dll to add flows and filters to the stack. The packet classifier is responsible for classifying packets according to the packet filters indicated by Traffic.dll. The packet scheduler maintains separate queues for each classification of traffic and includes a conformance analyzer, shaper, and packet sequencer. The shaper manages flows into the packet queues at the agreed-upon rate, and the sequencer feeds packets to the network interface in the order of priority from the queues that it manages. Traffic that has no QoS specification goes into the best effort queue, which is lowest in priority.

Figure 2 QoS/RSVP architecture

The flowchart in figure 2 illustrates how an application uses QoS RSVP to deliver a flow of data to a client or clients. The application is an audio server, and it needs 1 megabit-per-second of reliable bandwidth to provide acceptable audio quality to a client. RSVP supports both unicast and multicast flows. This example uses a unicast flow to a single client.

The application initializes and completes a structure to be provided to GQoS. This structure includes a sending and receiving flow specification. Flow specifications include parameters such as peak bandwidth, latency, delay variation, service type, and so on. Examples of service types include Best Effort and Guaranteed.

The application then calls WSAConnect to connect to the client. A call to this function triggers a number of events. RSVP is invoked to signal the network by sending special path messages. A path message is sent to the same destination IP address that the flow goes to; however, it is intended to set up the routers in the flow and to identify the flow. A router receiving a path message inserts its own IP address into the path message's last hop and forwards the message to the next router in the path until it reaches the client. This gives the client the ability to understand the path between the sender and itself and to reserve bandwidth along that path for the application. The client returns a reservation request (again describing the desired flow) back along the same path. The routers along the path are responsible for examining the resources available to them and determining if they can accept the reservation. If all of the routers along the path agree to accept the reservation, the application can count on having the desired network bandwidth and other characteristics available.

Because networks are dynamic and the server or client could mistakenly abandon their resources without notifying the network, both path messages and reservation requests must be refreshed frequently. If there were no changes in the network, additional path messages and reservations refresh only the existing path. However, if a new route appears, the path taken by the flow could change on the fly as the network makes adjustments.

When a server application is used to multicast to many clients, a similar sequence of events occurs. One interesting difference is that when routers receive reservation requests from various clients referencing the same flow, they can merge reservation requests, rather than maintaining individual reservations for the same information flow.

For more, detailed information on these topics, see the Winsock2 specification and RFC 2205.

IP Security (IPSec)

IP Security (IPSec) is another new feature in Windows 2000. IPSec features and implementation details are very complex and are described in detail in a series of RFCs and IETF drafts and in other Microsoft white papers. IPSec uses cryptography-based security to provide access control, connectionless integrity, data origin authentication, protection against replays, confidentiality, and limited traffic-flow confidentiality. Because IPSec is provided at the IP layer, its services are available to the upper-layer protocols in the stack and, transparently, to existing applications.

IPSec enables a system to select security protocols, decide which algorithm(s) to use for the service(s), and establish and maintain cryptographic keys for each security relationship. IPSec can protect paths between hosts, between security gateways, or between hosts and security gateways. The services available and required for traffic are configured using IPSec policy. IPSec policy may be configured locally on a computer or can be assigned through Windows 2000 Group Policy mechanisms using the Active Directory™ services. When using the Active Directory, hosts detect policy assignment at startup, retrieve the policy, and then periodically check for policy updates. The IPSec policy specifies how computers trust each other. IPSec can use either certificates or Kerberos as an authentication method. The easiest trust to use is the Windows 2000 domain trust based on Kerberos. Predefined IPSec policies are configured to trust computers in the same or other trusted Windows 2000 domains.

Each IP datagram processed at the IP layer is compared to a set of filters that are provided by the security policy, which is maintained by an administrator for a computer that belongs to a domain. IP can do one of three things with any datagram:

An IPSec policy contains a filter, filter action, authentication, tunnel setting, and connection type. For example, two stand-alone computers in the same Windows 2000 domain can be configured to use IPSec between them and activate the secure server policy. If the two computers are not members of the same or a trusted domain, trust must be configured using a certificate or preshared key in a secure server mode by:

Once the policy has been put in place, traffic that matches the filters uses the services provided by IPSec. When IP traffic (including something as simple as a ping in this case) is directed at one host by another, a Security Association (SA) is established through a short conversation over UDP port 500, through Internet Key Exchange service (IKE), and then the traffic begins to flow. The following network trace illustrates setting up a TCP connection between two such IPSec-enabled hosts. The only parts of the IP datagram that are unencrypted and visible to Netmon after the SA is established are the media access control and IP headers:

Source IP Dest IP Prot Description
davemac-ipsec calvin-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 216 (0xD8)
calvin-ipsec davemac-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 216 (0xD8)
davemac-ipsec calvin-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 128 (0x80)
calvin-ipsec davemac-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 128 (0x80)
davemac-ipsec calvin-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 76 (0x4C)
calvin-ipsec davemac-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 76 (0x4C)
davemac-ipsec calvin-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 212 (0xD4)
calvin-ipsec davemac-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 172 (0xAC)
davemac-ipsec calvin-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 84 (0x54)
calvin-ipsec davemac-ipsec UDP Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 92 (0x5C)
davemac-ipsec calvin-ipsec IP ID = 0xC906; Proto = 0x32; Len: 96
calvin-ipsec davemac-ipsec IP ID = 0xA202; Proto = 0x32; Len: 96
davemac-ipsec calvin-ipsec IP ID = 0xCA06; Proto = 0x32; Len: 88

Opening one of the IP datagrams sent after the SA is established reveals very little of what is actually in the datagram (a TCP SYN, or connection request). The only clear parts of the packet are the Ethernet and IP headers. Even the TCP header is encrypted and cannot be parsed by Netmon if ESP is used.

Src IP Dest IP Protoc Description
===================================================
davemac-ipsec calvin-ipsec IP ID = 0xC906; Proto = 0x32; Len: 96

+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
IP: ID = 0xC906; Proto = 0x32; Len: 96
IP: Version = 4 (0x4)
IP: Header Length = 20 (0x14)
IP: Precedence = Routine
IP: Type of Service = Normal Service
IP: Total Length = 96 (0x60)
IP: Identification = 51462 (0xC906)
+ IP: Flags Summary = 2 (0x2)
IP: Fragment Offset = 0 (0x0) bytes
IP: Time to Live = 128 (0x80)
IP: Protocol = 0x32
IP: Checksum = 0xD55A
IP: Source Address = 172.30.250.139
IP: Destination Address = 157.59.24.37
IP: Data: Number of data bytes remaining = 76 (0x004C)

00000: 52 A4 68 7B 94 80 00 00 90 1D 84 80 08 00 45 00 R.h{..........E.
00010: 00 60 C9 06 40 00 80 32 D5 5A AC 1E FA 8B 9D 3B .`..@..2.Z.....;
00020: 18 25 18 D9 03 E8 00 00 00 01 F6 EF D0 23 1C 59 .%...........#.Y
00030: BD 01 78 BE 69 24 D6 EB AE 4F 08 DA 0F D4 6C 04 ..x.i$...O....l.
00040: 5F BC A6 E0 8D BE 5C 89 2D 56 60 80 FA 8B CC 5E _.....\.-V`....^
00050: 4E 61 3D 46 75 B9 D1 5B 52 45 79 7D 1E 36 1F 01 Na=Fu..[REy}.6..
00060: FF 25 E5 BA 48 AF D7 7A D5 9A 34 3E 5D 7D .%..H..z..4>]}

Using a secure server policy also restricts all other types of traffic from reaching destinations that do not understand IPSec or are not part of the same trusted group. Secure Initiator policy provides settings that apply best to servers; traffic security is attempted, but if the client does not understand IPSec, the negotiation falls back to sending clear text packets.

When IPSec is used to encrypt data, network performance generally drops, due to the processing overhead of encryption. One possible method for reducing the impact of this overhead is to offload the processing to a hardware device. Because NDIS 5.0 supports task offloading, it is feasible to include encryption hardware on NICs. NICs supporting IPSec hardware offload are available from several vendors.

IPSec promises to be popular for protecting both public network traffic and internal corporate/government traffic that requires confidentiality. One common implementation may be to apply secure server IPSec policies only to specific servers that are used to store and/or serve confidential information.

Internet Group Management Protocol (IGMP)

Windows 2000 provides level 2 (full) support for IP multicasting (IGMP version 2), as described in RFC 1112 and RFC 2236. The introduction to RFC 1112 provides a good overall summary of IP multicasting. The text reads:

"IP multicasting is the transmission of an IP datagram to a host group—a set of zero or more hosts identified by a single IP destination address. A multicast datagram is delivered to all members of its destination host group with the same 'best-effort' reliability as regular unicast IP datagrams; that is, the datagram is not guaranteed to arrive intact to all members of the destination group or in the same order relative to other datagrams.

"The membership of a host group is dynamic; that is, hosts may join and leave groups at any time. There is no restriction on the location or number of members in a host group. A host may be a member of more than one group at a time. A host need not be a member of a group to send datagrams to it.

"A host group may be permanent or transient. A permanent group has a well-known, administratively assigned IP address. It is the address—not the membership of the group—that is permanent; at any time a permanent group may have any number of members, even zero. Those IP multicast addresses that are not reserved for permanent groups are available for dynamic assignment to transient groups that exist only as long as they have members.

"Internetwork forwarding of IP multicast datagrams is handled by multicast routers that may be co-resident with, or separate from, Internet gateways. A host transmits an IP multicast datagram as a local network multicast that reaches all immediately-neighboring members of the destination host group. If the datagram has an IP time-to-live greater than 1, the multicast router(s) attached to the local network take responsibility for forwarding it towards all other networks that have members of the destination group. On those other member networks that are reachable within the IP time-to-live, an attached multicast router completes delivery by transmitting the datagram as a local multicast."

IP/ARP Extensions for IP Multicasting

To support IP multicasting, an additional route is defined on the host. The route (added by default) specifies that if a datagram is being sent to a multicast host group, it should be sent to the IP address of the host group through the local interface card, and not forwarded to the default gateway. The following route (which you can discover using the route print command) illustrates this:

Network Address Netmask Gateway Address Interface Metric
224.0.0.0 224.0.0.0 10.99.99.1 10.99.99.1 1

Host group addresses are easily identified, as they are from the class D range, 224.0.0.0 to 239.255.255.255. These IP addresses all have 1110 as their high-order bits.

To send a packet to a host group, using the local interface, the IP address must be resolved to a media access control address. As stated in the RFCs:

"An IP host group address is mapped to an Ethernet multicast address by placing the low-order 23 bits of the IP address into the low-order 23 bits of the Ethernet multicast address 01-00-5E-00-00-00 (hex). Because there are 28 significant bits in an IP host group address, more than one host group address may map to the same Ethernet multicast address."

For example, a datagram addressed to the multicast address 225.0.0.5 would be sent to the (Ethernet) media access control address 01-00-5E-00-00-05. This media access control address is formed by the junction of 01-00-5E and the 23 low-order bits of 225.0.0.5 (00-00-05).

Because more than one host group address can map to the same Ethernet multicast address, the interface may indicate hand-up multicasts for a host group for which no local applications have a registered interest. These extra multicasts are discarded by TCP/IP.

Multicast Extensions to Windows Sockets

Internet Protocol multicasting is currently supported only on AF_INET sockets of type SOCK_DGRAM and SOCK_RAW. By default, IP multicast datagrams are sent with a Time to Live (TTL) of 1. Applications can use the setsockopt function to specify a TTL. By convention, multicast routers use TTL thresholds to determine how far to forward datagrams. These TTL thresholds are defined as follows:

Use of IGMP by Windows Components

Some Windows NT and Windows 2000 components use IGMP. For example, router discovery uses multicasts, by default. WINS servers use multicasting when attempting to locate replication partners.

Transmission Control Protocol (TCP)

TCP provides a connection-based, reliable byte-stream service to applications. Microsoft networking relies upon the TCP transport for logon, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. It can only be used for one-to-one communications.

TCP uses a checksum on both the headers and payload of each segment to reduce the chance that network corruption will go undetected. NDIS 5.0 provides support for task offloading, and Windows 2000 TCP takes advantage of this by allowing the NIC to perform the TCP checksum calculations if the NIC driver offers support for this function. Offloading the checksum calculations to hardware can result in performance improvements in very high-throughput environments. Windows 2000 TCP has also been hardened against a variety of attacks that were published over the past couple of years and has been subject to an internal security review intended to reduce susceptibility to future attacks. For instance, the initial sequence number algorithm has been modified so that ISNs increase in random increments, using an RC4-based random number generator initialized with a 2048-bit random key upon system startup.

TCP Receive Window Size Calculation and Window Scaling (RFC 1323)

The TCP receive window size is the amount of receive data (in bytes) that can be buffered at one time on a connection. The sending host can send only that amount of data before waiting for an acknowledgment and window update from the receiving host. The Windows 2000 TCP/IP stack was designed to tune itself in most environments and uses larger default window sizes than earlier versions. Instead of using a hard-coded default receive window size, TCP adjusts to even increments of the maximum segment size (MSS) negotiated during connection setup. Matching the receive window to even increments of the MSS increases the percentage of full-sized TCP segments used during bulk data transmission.

The receive window size defaults to a value calculated as follows:

  1. The first connection request sent to a remote host advertises a receive window size of 16 KB (16,384 bytes).
  2. Upon establishing the connection, the receive window size is rounded up to an increment of the maximum TCP segment size (MSS) that was negotiated during connection setup.
  3. If that is not at least four times the MSS, it is adjusted to 4 * MSS, with a maximum size of 64 KB unless a window scaling option (RFC 1323) is in effect.

For Ethernet, the window is normally set to 17,520 bytes (16 KB rounded up to twelve 1460-byte segments.) There are two methods for setting the receive window size to specific values:

To improve performance on high-bandwidth, high-delay networks, scalable windows support (RFC 1323) has been introduced in Windows 2000. This RFC details a method for supporting scalable windows by allowing TCP to negotiate a scaling factor for the window size at connection establishment. This allows for an actual receive window of up to 1 gigabyte (GB). RFC 1323 Section 2.2 provides a good description:

"The three-byte Window Scale option may be sent in a SYN segment by a TCP. It has two purposes: 1. indicate that the TCP is prepared to do both send and receive window scaling, and 2. communicate a scale factor to be applied to its receive window. Thus, a TCP that is prepared to scale windows should send the option, even if its own scale factor is 1. The scale factor is limited to a power of two and encoded logarithmically, so it may be implemented by binary shift operations.

TCP Window Scale Option (WSopt):
Kind: 3 Length: 3 bytes
+---------+---------+---------+
| Kind=3 |Length=3 |shift.cnt|
+---------+---------+---------+

"This option is an offer, not a promise; both sides must send Window Scale options in their SYN segments to enable window scaling in either direction. If window scaling is enabled, then the TCP that sent this option will right-shift its true receive-window values by 'shift.cnt' bits for transmission in SEG.WND. The value shift.cnt may be zero (offering to scale, while applying a scale factor of 1 to the receive window).

"This option may be sent in an initial <SYN> segment (in other words, a segment with the SYN bit on and the ACK bit off). It may also be sent in a <SYN,ACK> segment, but only if a Window Scale option was received in the initial <SYN> segment. A Window Scale option in a segment without a SYN bit should be ignored.

"The Window field in a SYN (in other words, a <SYN> or <SYN,ACK>) segment itself is never scaled."

When you read network traces of a connection that was established by two computers that support scalable windows, keep in mind that the window sizes advertised in the trace must be scaled by the negotiated scale factor. The scale factor can be observed in the connection establishment (three-way handshake) packets, as illustrated in the following Network Monitor capture:

*************************************************************************

Src Addr Dst Addr Protocol Description
THEMACS1 NTBUILDS TCP ....S., len:0, seq:725163-725163, ack:0,
win:65535, src:1217 dst:139

+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
+ IP: ID = 0xB908; Proto = TCP; Len: 64
TCP: ....S., len:0, seq:725163-725163, ack:0, win:65535, src:1217
dst:139 (NBT Session)
TCP: Source Port = 0x04C1
TCP: Destination Port = NETBIOS Session Service
TCP: Sequence Number = 725163 (0xB10AB)
TCP: Acknowledgement Number = 0 (0x0)
TCP: Data Offset = 44 (0x2C)
TCP: Reserved = 0 (0x0000)
+ TCP: Flags = 0x02 : ....S.
TCP: Window = 65535 (0xFFFF)
TCP: Checksum = 0x8565
TCP: Urgent Pointer = 0 (0x0)
TCP: Options
+ TCP: Maximum Segment Size Option
TCP: Option Nop = 1 (0x1)
TCP: Window Scale Option
TCP: Option Type = Window Scale
TCP: Option Length = 3 (0x3)
TCP: Window Scale = 5 (0x5)
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
+ TCP: Timestamps Option
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
+ TCP: SACK Permitted Option

00000: 8C 04 C8 BD A3 82 00 00 50 7D 83 80 08 00 45 00
........P}....E.
00010: 00 40 B9 08 40 00 80 06 A7 1A 9D 36 15 FD AC 1F
.@..@......6....
00020: 3B 42 04 C1 00 8B 00 0B 10 AB 00 00 00 00 B0 02
;B..............
00030: FF FF 85 65 00 00 02 04 05 B4 01 03 03 05 01 01
...e............
00040: 08 0A 00 00 00 00 00 00 00 00 01 01 04 02 ..............
*************************************************************************

The computer sending the packet above is offering the Window Scale option, with a scaling factor of 5. If the target computer responds, accepting the Window Scale option in the SYN-ACK, then it is understood that any TCP window advertised by this computer needs to be left-shifted 5 bits from this point onward (the SYN itself is not scaled). For example, if the computer advertised a 32 KB window in its first send of data, this value would need to be left-shifted (shifting in 0's from the right) 5 bits as shown below:

32Kbytes = 0x7fff = 111 1111 1111 1111
Left-shift 5 bits = 1111 1111 1111 1110 0000 = 0xffffe (1,048,544 bytes)

As a check, left-shifting a number 5 bits is equivalent to multiplying it
by 25, or 32. 32767 * 32 = 1,048,544

The scale factor is not necessarily symmetrical, so it may be different for each direction of data flow.

Windows 2000 uses window scaling automatically if the TcpWindowSize is set to a value greater than 64 KB, and the Tcp1323Opts registry parameter is set appropriately. See Appendix A for details on setting this parameter.

Delayed Acknowledgments

As specified in RFC 1122, TCP uses delayed acknowledgments (ACKs) to reduce the number of packets sent on the media. The Microsoft TCP/IP stack takes a common approach to implementing delayed ACKs. As data is received by TCP on a connection, it only sends an acknowledgment back if one of the following conditions is met:

In summary, normally an ACK is sent for every other TCP segment received on a connection, unless the delayed ACK timer (200 milliseconds) expires. The delayed ACK timer can be adjusted through the TcpDelAckTicks registry parameter, which is new in Windows 2000.

TCP Selective Acknowledgment (RFC 2018)

Windows 2000 introduces support for an important performance feature known as Selective Acknowledgement (SACK). SACK is especially important for connections using large TCP window sizes. Prior to SACK, a receiver could only acknowledge the latest sequence number of contiguous data that had been received, or the left edge of the receive window. When SACK is enabled, the receiver continues to use the ACK number to acknowledge the left edge of the receive window, but it can also acknowledge other non-contiguous blocks of received data individually. SACK uses TCP header options, as shown below. This text was taken directly from RFC 2018:

"Sack-Permitted Option

"This two-byte option may be sent in a SYN by a TCP that has been extended to receive (and presumably process) the SACK option once the connection has opened. It MUST NOT be sent on non-SYN segments.

TCP Sack-Permitted Option:
Kind: 4
+---------+---------+
| Kind=4 | Length=2|
+---------+---------+

"Sack Option Format

"The SACK option is to be used to convey extended acknowledgment information from the receiver to the sender over an established TCP connection.

TCP SACK Option:
Kind: 5
Length: Variable
+--------+--------+
| Kind=5 | Length |
+--------+--------+--------+--------+
| Left Edge of 1st Block |
+--------+--------+--------+--------+
| Right Edge of 1st Block |
+--------+--------+--------+--------+
| |
/ . . . /
| |
+--------+--------+--------+--------+
| Left Edge of nth Block |
+--------+--------+--------+--------+
| Right Edge of nth Block |
+--------+--------+--------+--------+

When SACK is enabled (the default), a packet or series of packets can be dropped, and the receiver can inform the sender of exactly which data has been received, and where the holes in the data are. The sender can then selectively retransmit the missing data without needing to retransmit blocks of data that have already been received successfully. SACK is controlled by the SackOpts registry parameter. The Network Monitor capture below illustrates a host acknowledging all data up to sequence number 54857341, plus the data from sequence number 54858789-54861685.

+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
+ IP: ID = 0x1A0D; Proto = TCP; Len: 64
TCP: .A...., len:0, seq:925104-925104, ack:54857341, win:32722, src:1242 dst:139
TCP: Source Port = 0x04DA
TCP: Destination Port = NETBIOS Session Service
TCP: Sequence Number = 925104 (0xE1DB0)
TCP: Acknowledgement Number = 54857341 (0x3450E7D)
TCP: Data Offset = 44 (0x2C)
TCP: Reserved = 0 (0x0000)
+ TCP: Flags = 0x10 : .A....
TCP: Window = 32722 (0x7FD2)
TCP: Checksum = 0x4A72
TCP: Urgent Pointer = 0 (0x0)
TCP: Options
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
+ TCP: Timestamps Option
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
TCP: SACK Option
TCP: Option Type = 0x05
TCP: Option Length = 10 (0xA)
TCP: Left Edge of Block = 54858789 (0x3451425)
TCP: Right Edge of Block = 54861685 (0x3451F75)

TCP Timestamps (RFC 1323)

Another RFC 1323 feature introduced in Windows 2000 is support for TCP time stamps. Like SACK, time stamps are important for connections using large window sizes. Time stamps were conceived to assist TCP in accurately measuring round-trip time (RTT) to adjust retransmission time-outs. The TCP header option for time stamps is shown here, from RFC 1323:

"TCP Timestamps Option (TSopt):

Kind: 8

Length: 10 bytes

+-------+-------+---------------------+---------------------+
|Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)|
+-------+-------+---------------------+---------------------+
1 1 4 4

"The Timestamps option carries two four-byte time stamp fields. The time-stamp value field (TSval) contains the current value of the time-stamp clock of the TCP sending the option.

"The Timestamp Echo Reply field (TSecr) is only valid if the ACK bit is set in the TCP header; if it is valid, it echoes a timestamp value that was sent by the remote TCP in the TSval field of a Timestamps option. When TSecr is not valid, its value must be zero. The TSecr value will generally be from the most recent Timestamp option that was received; however, there are exceptions that are explained below.

"A TCP may send the Timestamps option (TSopt) in an initial <SYN> segment (i.e., segment containing a SYN bit and no ACK bit), and may send a TSopt in other segments only if it received a TSopt in the initial <SYN> segment for the connection."

The Timestamps option field can be viewed in a Network Monitor capture by expanding the TCP options field, as shown below:

TCP: Timestamps Option
TCP: Option Type = Timestamps
TCP: Option Length = 10 (0xA)
TCP: Timestamp = 2525186 (0x268802)
TCP: Reply Timestamp = 1823192 (0x1BD1D8)

The use of time stamps is disabled by default. It can be enabled using theTcp1323Opts registry parameter, explained in Appendix A.

Path Maximum Transmission Unit (PMTU) Discovery

PMTU discovery is described in RFC 1191. When a connection is established, the two hosts involved exchange their TCP maximum segment size (MSS) values. The smaller of the two MSS values is used for the connection. Historically, the MSS for a host has been the MTU at the link layer minus 40 bytes for the IP and TCP headers. However, support for additional TCP options, such as time stamps, has increased the typical TCP+IP header to 52 or more bytes.

Figure 3 MTU versus MSS

When TCP segments are destined to a non-local network, the Don't Fragment bit is set in the IP header. Any router or media along the path can have an MTU that differs from that of the two hosts. If a media segment has an MTU that is too small for the IP datagram being routed, the router attempts to fragment the datagram accordingly. It then finds that the Don't Fragment bit is set in the IP header. At this point, the router should inform the sending host that the datagram can not be forwarded further without fragmentation. This is done with an ICMP Destination Unreachable Fragmentation Needed and DF Set message. Most routers also specify the MTU for the next hop by putting the value for it in the low-order 16 bits of the ICMP header field that is unused in RFC 792. See RFC 1191, section 4, for the format of this message. Upon receiving this ICMP error message, TCP adjusts its MSS for the connection to the specified MTU minus the TCP and IP header size so that any further packets sent on the connection are no larger than the maximum size that can traverse the path without fragmentation.

Note The minimum MTU permitted is 88 bytes, and Windows 2000 TCP enforces this limit.

Some noncompliant routers may silently drop IP datagrams that can not be fragmented or may not correctly report their next-hop MTU. If this occurs, it may be necessary to make a configuration change to the PMTU detection algorithm. There are two registry changes that can be made to the TCP/IP stack in Windows 2000 to work around these problematic devices. These registry entries are described in more detail in Appendix A:

The PMTU between two hosts can be discovered manually using the ping command with the -f (don't fragment) switch, as follows:

ping -f -n <number of pings> -l <size> <destination ip address>

As shown in the example below, the size parameter can be varied until the MTU is found. The size parameter used by ping is the size of the data buffer to send, not including headers. The ICMP header consumes 8 bytes, and the IP header is normally 20 bytes. In the case below (Ethernet), the link layer MTU is the maximum-sized ping buffer plus 28, or 1500 bytes:

C:\>ping -f -n 1 -l 1472 10.99.99.10

Pinging 10.99.99.10 with 1472 bytes of data:

Reply from 10.99.99.10: bytes=1472 time<10ms TTL=128

Ping statistics for 10.99.99.10:

Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),

Approximate round trip times in milli-seconds:

Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\>ping -f -n 1 -l 1473 10.99.99.10

Pinging 10.99.99.10 with 1473 bytes of data:

Packet needs to be fragmented but DF set.

Ping statistics for 10.99.99.10:

Packets: Sent = 1, Received = 0, Lost = 1 (100% loss),

Approximate round trip times in milliseconds:

Minimum = 0ms, Maximum = 0ms, Average = 0ms

In the example shown above, the IP layer returned an ICMP error message that ping interpreted. If the router had been a black hole router, ping would simply not be answered once its size exceeded the MTU that the router could handle. Ping can be used in this manner to detect such a router.

A sample ICMP Destination unreachable error message is shown here:

******************************************************************************
Src Addr Dst Addr Protocol Description
10.99.99.10 10.99.99.9 ICMP Destination Unreachable: 10.99.99.10
See frame 3

+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
+ IP: ID = 0x4401; Proto = ICMP; Len: 56
ICMP: Destination Unreachable: 10.99.99.10 See frame 3
ICMP: Packet Type = Destination Unreachable
ICMP: Unreachable Code = Fragmentation Needed, DF Flag Set
ICMP: Checksum = 0xA05B
ICMP: Next Hop MTU = 576 (0x240)
ICMP: Data: Number of data bytes remaining = 28 (0x001C)
ICMP: Description of original IP frame
ICMP: (IP) Version = 4 (0x4)
ICMP: (IP) Header Length = 20 (0x14)
ICMP: (IP) Service Type = 0 (0x0)
ICMP: Precedence = Routine
ICMP: ...0.... = Normal Delay
ICMP: ....0... = Normal Throughput
ICMP: .....0.. = Normal Reliability
ICMP: (IP) Total Length = 1028 (0x404)
ICMP: (IP) Identification = 45825 (0xB301)
ICMP: Flags Summary = 2 (0x2)
ICMP: .......0 = Last fragment in datagram
ICMP: ......1. = Cannot fragment datagram
ICMP: (IP) Fragment Offset = 0 (0x0) bytes
ICMP: (IP) Time to Live = 32 (0x20)
ICMP: (IP) Protocol = ICMP - Internet Control Message
ICMP: (IP) Checksum = 0xC91E
ICMP: (IP) Source Address = 10.99.99.9
ICMP: (IP) Destination Address = 10.99.99.10
ICMP: (IP) Data: Number of data bytes remaining = 8 (0x0008)
ICMP: Description of original ICMP frame
ICMP: Checksum = 0xBC5F
ICMP: Identifier = 256 (0x100)
ICMP: Sequence Number = 38144 (0x9500)

00000: 00 AA 00 4B B1 47 00 AA 00 3E 52 EF 08 00 45 00 ...K.G...>R...E.
00010: 00 38 44 01 00 00 80 01 1B EB 0A 63 63 0A 0A 63 .8D........cc..c
00020: 63 09 03 04 A0 5B 00 00 02 40 45 00 04 04 B3 01 c....[...@E.....
00030: 40 00 20 01 C9 1E 0A 63 63 09 0A 63 63 0A 08 00 @. ....cc..cc...
00040: BC 5F 01 00 95 00

This error was generated by using ping -f –n 1 -l 1000 on an Ethernet-based host to send a large datagram across a router interface that only supports an MTU of 576 bytes. When the router tried to place the large frame onto the network with the smaller MTU, it found that fragmentation was not allowed. Therefore, it returned the error message indicating that the largest datagram that could be forwarded is 0x240, or 576 bytes.

Dead Gateway Detection

Dead gateway detection is used to allow TCP to detect failure of the default gateway and to adjust the IP routing table to use another default gateway. The Microsoft TCP/IP stack uses the triggered reselection method described in RFC 816, with slight modifications based upon customer experiences and feedback.

When a TCP connection routed through the default gateway attempts to send a TCP packet to the destination a number of times (equal to one-half of the registry value TcpMaxDataRetransmissions) without receiving a response, the algorithm changes the Route Cache Entry (RCE) for that remote IP address to use the next default gateway in the list. When 25 percent of the TCP connections have moved to the next default gateway, the algorithm advises IP to change the computer's default gateway to the one that the connections are now using.

For example, assume that there are currently TCP connections to 11 different IP addresses that are being routed through the default gateway. Now assume that the default gateway fails, that there is a second default gateway configured, and that the value for TcpMaxDataRetransmissions is at the default of 5.

When the first TCP connection tries to send data, it does not receive any acknowledgments. After the third retransmission, the RCE for that remote IP address is switched to the next default gateway in the list. At this point, any TCP connections to that one remote IP address have switched over, but the remaining connections still try to use the original default gateway.

When the second TCP connection tries to send data, the same thing happens. Now, two of the 11 RCEs point to the new gateway.

When the third TCP connection tries to send data, after the third retransmission, three of 11 RCEs have been switched to the second default gateway. Because, at this point, over 25 percent of the RCEs have been moved, the default gateway for the whole computer is moved to the new one.

That default gateway remains the primary one for the computer until it experiences problems (causing the dead gateway algorithm to try the next one in the list again) or until the computer is restarted.

When the search reaches the last default gateway, it returns to the beginning of the list.

TCP Retransmission Behavior

TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment has been received for the data in a given segment before the timer expires, the segment is retransmitted. For new connection requests, the retransmission timer is initialized to 3 seconds (controllable using the TcpInitialRtt per-adapter registry parameter), and the request (SYN) is resent up to the value specified in TcpMaxConnectRetransmissions (the default for Windows 2000 is 2 times). On existing connections, the number of retransmissions is controlled by the TcpMaxDataRetransmissions registry parameter (5 by default). The retransmission time-out is adjusted on the fly to match the characteristics of the connection, using Smoothed Round Tr