rep

Published on February 2017 | Categories: Documents | Downloads: 114 | Comments: 0 | Views: 948

of 44

api-19588525
Subscribe 0

Content

INFINIBAND
CS 708 Seminar

NEETHU RANJIT (Roll No. 05088) B. Tech. Computer Science & Engineering

College of Engineering Kottarakkara Kollam 691 531 Ph: +91.474.2453300 http://www.cek.ihrd.ac.in [email protected]

Certiﬁcate

This is to certify that this report titled InﬁniBand is a bonaﬁde record of the CS 708 Seminar work done by Miss.NEETHU RANJIT, Reg No.10264042 , Seventh Semester B. Tech. Computer Science & Engineering student, under our guidance and supervision, in partial fulﬁllment of the requirements for the award of the degree, B. Tech. Computer Science and Engineering of Cochin University of Science & Technology.

October 16, 2008

Guide

Coordinator & Dept. Head

Mr Renjith S.R Lecturer Dept. of Computer Science & Engg.

Mr Ahammed Siraj K K Asst. Professor Dept. of Computer Science & Engg.

Acknowledgments
I express my whole hearted thanks to our respected Principal Dr Jacob Thomas, Mr.Ahammed Siraj sir, Head of the Department, for providing me with the guidance and facilities for the seminar. I wish to express my sincere thanks toMr Renjith sir, lecturer in Computer Science Department,and also my guide for his timely advises during the course period of my seminar.I thank all faculty members of College of Engineering Kottarakara for their cooperation in completing my seminar. My sincere thanks to all those well wishers and friends who have helped me during the course of the seminar work and have made it a great success. Above all I thank the Almighty Lord, the foundation of all wisdom for guiding me step by step throughout my seminar.Last but not the least i would like to thank my parents for their moral support.

NEETHU RANJIT

Abstract InﬁniBand is a powerful new architecture designed to support I/O connectivity for the internet infrastructure. InﬁniBand is supported by all major OEM server vendors as a means to expand and create the next generation I/O interconnect standard in servers. For the ﬁrst time, a high volume, industry standard I/O interconnect extend the role of traditional in the box requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InﬁniBands management infrastructure. The comprehensive nature of architecture provide a overview to major sections of InﬁniBand I/O speciﬁcation ranges from industry standard electrical interfaceand mechanical connectors to well deﬁned software and management services.InﬁniBand is unique in providing connectivity in a way previously reserved only for traditional networking. This uniﬁcation of I/O and system area networking require a new architecture domain. Underlying this major transition is InﬁniBands superior abilities to support the internet requirement for RAS: Reliability, Availability, and Serviceability. The InﬁniBand Architecture (IBA) is an industry standard architecture for server I/O and interprocessor communication.IBA that enables QoS: Quality of Services which support with certain mechanisms. These mechanisms are basically service levels,virtual lanes and table based arbitration of virtual lanes.InﬁniBand has a formal model to manage the InﬁniBand to provide QoS,according to this model, each application need a sequence of entries in the IBA arbitration tables based on requirements. These requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InﬁniBands management infrastructure. The comprehensive nature of architecture provide a overview to major sections of InﬁniBand I/O speciﬁcation ranges from industry standard electrical interface and mechanical connectors to well deﬁned software and management services.InﬁniBridge is the channel adapter architecture of InﬁniBand which aids packet switching feature of InﬁniBand.

i

Contents
1 INTRODUCTION 2 INFINIBAND ARCHITECTURE 1 3 5 5 5 6 7 9 9 10 11

3 COMPONENTS OF INFINIBAND 3.1 HCA and TCA Channel adapters . . . . . . . . . . . . 3.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . 4 INFINIBAND BASIC FABRIC TOPOLOGY 5 IBA Subnet 5.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Endnodes . . . . . . . . . . . . . . . . . . . . . . . . . 6 FLOW CONTROL 7 INFINIBAND SUBNET MANAGEMENT AND QoS 12

8 REMOTE DIRECT ACESS (RDMA) 14 8.1 Comparing a Traditional Server I/O and RDMA-Enabled I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 9 INFINIBAND PROTOCOL 9.1 Physical Layer . . . . . . . 9.2 Link Layer . . . . . . . . . 9.3 Network Layer . . . . . . 9.4 Transport Layer . . . . . . STACK . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 18 19 20 21 22 24 24

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

10 COMMUNICATION SERVICES 10.1 Communication Stack :InﬁniBand support for the Virtual Interface Architecture (VIA) . . . . . . . . . . . . 11 INFINIBAND FABRIC VERSUS SHARED BUS 12 INFINIBRIDGE 12.1 Hardware transport performance of InﬁniBridge . . . .

ii

13 INFINIBRIDGE CHANNEL ADAPTER ARCHITECTURE 26 14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE 27 15 FORMAL MODEL TO MANAGE INFINIBAND ARBITRATION TABLES TO PROVIDE TO QUALITY OF SERVICE(QoS) 15.1 THREE MECHANISMS TO PROVIDE QoS . . . . . 15.1.1 Service Level . . . . . . . . . . . . . . . . . . . 15.1.2 Virtual Lanes . . . . . . . . . . . . . . . . . . . 15.1.3 Virtual Arbitration table . . . . . . . . . . . .

29 29 29 30 30

16 FORMAL MODEL FOR THE INFINIBAND ARBITRATION TABLE 31 16.0.4 Initial Hypothesis . . . . . . . . . . . . . . . . 33 17 FILLING IN THE VL ARBITRATION 17.1 Insertion and elimination in the table . 17.1.1 Example 1. . . . . . . . . . . . . 17.2 Disfragmentation Algorithm . . . . . . . 17.3 Reordering Algorithm . . . . . . . . . . 17.4 Global management of the table . . . . 18 CONCLUSION REFERENCES TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 36 36 36 37 38

. . . . .

. . . . .

iii

1

INTRODUCTION

Bus architectures have a tremendous amount of inertia because they dictate the bus interface architecture of semiconductor devices. For this reason successful bus architectures typically enjoy a dominant position for ten years or more. The PCI bus was introduced to the standard PC architecture in the early 90s and has maintained its dominance with only one major upgrade during that period:from 32bit/33MHz to 64bit/66Mhz.The PCI-X initiative takes this one step further to 133MHz and seemingly should provide the PCI architecture with a few more years of life.But there is a divergence between what personal computer and servers require. Throughout the past decade of fast paced computer development the traditional Peripheral Component Interconnect architecture has continued to be the dominant input/output standard for most internal back-plane and external peripheral connections.However,these days,the PCI bus,using shared bus approach is beginning to be noticeably lagging.Performance limitations, poor bandwidth and reliability issues are surfacing within the higher market tiers, especially as the PCI bus is quickly becoming an outdated technology. Computers are made up of a number of addressable elements-CPU, memory,screen,hard disks,LAN and SAN interface etc, that use a systems bus for communications. As these elements have become faster, the system bus and overhead associated with data movement commonly reﬀered to as I/O between devices has become a gating factor in computer performance.To address the problem of server performance with respect to I/O in particular,InﬁniBand was developed as a standards-based protocol to provide data movement data movement oﬄoad from the CPU to dedicated hardware, thus allowing more CPU to be dedicated to application processing.As a result ,InﬁniBand,by leveraging networking technologies and principles provide scable,high bandwidth transport for eﬃcient communications between InﬁniBand attached devices. InﬁniBand technology advances I/O connectivity for data center and enterprise infrastructure deployment, overcoming the I/O bottleneck in todays server architectures. Although primarily suited for next generation server I/O, the InﬁniBand can also extend to the embedded computing, storage, and telecommunications industries. This high-volume, industry-standard I/O interconnect extends the role of traditional backplane and board buses beyond the physical connector.

1

Another major bottleneck is the scalability problems with parallelbus architectures such as the peripheral component interconnect (PCI). As these buses scale in speed, they cant support the multiple network interfaces that system designersrequire. For example the PCI-X bus at 133 MHz can only support one slot and at higher speeds these buses begin to look like point-to point connections. Mellanox Technologies InﬁniBand silicon product, InﬁniBridge, lets system designers construct entire fabrics based on the devices switching and channel adapter functionality. InﬁniBridge implements an advanced set of packet switching, quality of service, and ﬂow control mechanisms. These capabilities support multiprotocol environments with many I/O devices shared by multiple servers. These InﬁniBridge features include an integrated switch and PCI channel adapter, InﬁniBand 1X and 4X link speeds (deﬁned as 2.5 and 10 Gbps), eight virtual lanes, a maximum transfer unit (MTU) size of up to 2 Kbytes. InﬁniBridge also oﬀers multicast support, an embedded subnet management agent, and InﬁniPCI for transparent PCI-to PCI. InﬁniBand is architecture and speciﬁcation for data ﬂow between processors and I/O devices that promise greater band width and almost unlimited expandability.Inﬁniband is hence used to replace the existing Peripheral Component Interconnect (PCI).Oﬀering throughput of up to 2.5 gigabytes per second and support for up to 64000addresable devices, the architecture. Also promises increased reliability better sharing of data between clustered processors, and built in security. The InﬁniBand architecture spec was released by the InﬁniBand Trade association. InﬁniBand is backed by top companies in the industry like Compaq,Dell,Hewleet Packard,IBM,Intel,Microsoft and Sun. Underlying the major I/O transition in InﬁniBand is able to provide a unique feature of Quality Of Service and many mechanisms exist to provide this once such mechanism is the formal method of using arbitration table.

2

2

INFINIBAND ARCHITECTURE

InﬁniBand is a switched, point-to-point interconnect for data centers based on a 2.5-Gbps link speed up to 30 Gbps. The architecture deﬁnes a layered hardware protocol (physical, link, network, and transport layers) and a software layer to support fabric management and low-latency communication between devices. InﬁniBand provides transport services for the upper layer protocols and supports ﬂow control and Quality Of Service to provide ordered,guaranteed packet delivery across the fabric.An InﬁniBand fabric may comprise a number of inﬁniband subnets that are inter connected using InﬁniBand routers,where each subnet may conist of one or more Inﬁniband switchesand InﬁniBand attached switches. The InﬁniBand standard deﬁnes Reliability, Availability, and Serviceability from the ground up, making the speciﬁcation eﬃcient to implement in silicon yet able to support a broad range of applications. InﬁniBands physical layer supports a wide range of media by using a diﬀerential serial interconnect with an embedded clock. This signaling supports printed circuit board, backplane,copper, and ﬁber links; it leaves room for further growth in speed and media types. The physical layer implements 1X, 4X, and 12X links by byte striping over multiple links.The InﬁniBand layered protocol features sidebar lists InﬁniBands other features.An InﬁniBand system area network has four basic system components that interconnect using InﬁniBand links, as Fig 1 shows: The host channel adapter (HCA) terminates connection for a host node. It includes hardware features to support high-performance memory transfers into CPU memory. The target channel adapter (TCA) terminates connection for a peripheral node. It deﬁnes a subset of HCA functionality and can be optimized for embedded applications. The switch handles link-layer packet forwarding. A switch does not consume or generate packets other than management packets. The router sends packets between subnets using the network layer. InﬁniBand routers divide InﬁniBand networks into subnets and do not consume or generate packets other than management packets. A subnet manager runs on each subnet and handles device and connection management tasks. A subnet manager can run on a host or embedded in switches and routers. All system components must include a subnet management agent that handles communication with the subnet manager.

3

Figure 1: INFINIBAND ARCHITECTURE

4

3

COMPONENTS OF INFINIBAND

The main components in the InﬁniBand architecture are:

3.1

HCA and TCA Channel adapters

HCAs are present in servers or even desktop machines and provide an interface that is used to interhrate the InﬁniBand with the operating system.TCAs are present on I/O devices such as RAID subsystem or a JBOD subsystem.Host and Target Channel adapters present an interface to the layers above them that allow those layers to generate and consume packets.In the case of a server writing a ﬁle to a storage device,the host is generating the packets that are then consumed by the storage device. Each channel adapter has one or more ports.A channel adapter with more than one port may be connected to multiple switch ports.

3.2

Switches

Switches simply forward packets between two of their ports based on the established routing table and addressing information stored on the packets.Acollection of end nodes connected to one another through one or more switches form a subnet.Each subnet must have atleast one subnet manager that is responsible for the conﬁguration and management of the subnet

5

Figure 2: InﬁniBand Switch

3.3

Routers

Are like switches in the respect that they simply forward packets between their ports. The diﬀerence between routers and the switches is that a router is used to interconnect two or more subnets to form a multidomain system area network. Within a subnet each port is assigned a unique identiﬁer by the subnet manager called the LOCAL ID or LID. In addition to the LID each port is assigned a globally unique identiﬁer called the GID. Main feature of the InﬁniBand architecture is that is not available in the current shared bus I/0 architecture is the ability to partition the ports within the fabric that can communicate with one another. This is useful for partitioning the available storage across one or more servers for management reasons.

6

Figure 3: System Network of Inﬁniband

4 INFINIBAND BASIC FABRIC TOPOLOGY
Inﬁniband is a high -speed serial ,channel based ,switch-fabric message passing architecture that can have server,ﬁbre channel,SCSI RAID,router and other end nodes each with its own dedicated fat pipe.Each node can talk to any other node in a many-yo-many conﬁguration.redundant paths can be set up through an InﬁniBand Fabric for fault tolerance and InﬁniBand routers can connect multiple subnets. Figure below shows the simplest conﬁguration of an InﬁniBand Installation,where two or more nodes are connected to one another through the fabric.A node represents either a host device such as a server or an I/O device such as RAID subsystem.The fabric itself may consist of a single switch in the simplest case or a collection of interconnected switches and routers.Each connection between nodes ,switches,and routers is a point-point ,serial connection.

7

Figure 4: InﬁniBand Fabric Topology

8

Figure 5: IBA SUBNET

5

IBA Subnet

The smallest complete IBA unit is a subnet, illustrated in the ﬁgure . Multiple subnets can be joined by routers (notshown) to create large IBA networks.The elements of a subnet, as shown in the ﬁgure, are endnodes, switches, links, and a subnet manager. Endnodes, such as hosts and devices, send messages over linksto other endnodes; the messages are routed by switches.Routing is deﬁned, and subnet discovery performed, by the Subnet Manager. Channel Adapters (CAs) (not shown) connect endnodes to links.

5.1

Links

IBA links are bidirectional point-to-point communication channels, and may be either copper and optical ﬁbre. The signalling rate on all links is 2.5 Gbaud in the 1.0 release; later releases will undoubtedly be faster. Automatic training sequences are deﬁned in the architecture

9

that will allow compatibility with later faster speeds. The physical links may be used in parallel to achieve greater bandwidth. The different link widths are referred to as 1X, 4X, and 12X. The basic 1X copper link has four wires, comprising a diﬀerential signaling pair for each direction. Similarly, the 1X ﬁbre link has two optical ﬁbres, one for each direction. Wider widths increase the number of signal paths as implied. There is also a copper backplane connection allowing dense structures of modules to be constructed; unfortunately, an illustration of that which reproduces adequately in black and white were not available at the time of publication. The 1X size allows up to six ports on the faceplate of the standard (smallest) size IBA module. Short reach (multimode) optical ﬁbre links are provided in all three widths; while distances are not speciﬁed (as explained earlier), it is expected that they will reach 250m for 1X and 125m for 4X and 12X. Long reach (single mode) ﬁber is deﬁned in the 1.0 IBA speciﬁcation only for 1X widths, with an anticipated reach of up to 10Km.

5.2

Endnodes

IBA endnodes are the ultimate sources and sinks of communicationin IBA. They may be host systems or devices(network adapters, storage subsystems, etc.). It is also possible that endnodes will be developed that are bridges to legacy I/O busses such as PCI, but whether and how that is done is vendor-speciﬁc; it is not part of the InﬁniBand architecture. Note that as a communication service, IBA makes no distinction between these types; an endnode is simply an endnode. So all IBA facilities may be used equally to communicate between hosts and devices; or between hosts and other hosts like normal networking; or even directly between devices, e.g., direct disk-to-tape backup without any load imposed on a host. IBA deﬁnes several standard form factors for devices used as endnodes, illustrated in Figure 3: standard, wide, tall, and tall wide. The standard form factor is approximately 20x100x220 mm. Wide doubles the width, tall doubles

10

Figure 6: Flow control in InﬁniBand

6

FLOW CONTROL

InﬁniBand deﬁnes two levels of credit-based ﬂow control to manage congestion: link level and end-to-end. Link-level ﬂow control applies back pressure to traﬃc on a link, while end-to end ﬂow control protects against buﬀer over-ﬂow at endpoint connections that might be multiple hops away. Each receiving end of a link/connection supplies credits to the sending device to specify the amount of data that the device can reliably receive. Sending devices do not transmit data unless the receiver advertises credits indicating available receive buﬀer space. The link and connection protocols have built in credit passing between each device to guarantee reliable ﬂow control operation. InﬁniBand handles link-level ﬂow control on a per-quality-of-service-level (virtual lane) basis. InﬁniBand has a unidirectional 2.5 Gbps(250MB/sec using 10 bits per data byte encoding called 8B/10B similar to 3 GIO)wire speed connection, and uses either one diﬀerential signal pair per direction called 1X,or 4(4X)or 12(12X) for bandwidth up to 30 Gbps per direction(12x2.5 Gbps).Bidirectional throughput with InﬁniBand is often expressed in MB/sec,yeiding 500MB/sec for 1X,2 GB/sec for 4X and 6 GB/sec for 12X respectively. Each bi-directional 1X connections consist of four wires, two for send and two for receive. Both ﬁber and copper are supported. Copper can be n the form of traces or cables and ﬁber distances between nodes can be far as 300 meters and more. Each inﬁniBand subnet can host up to 64 000 nodes

11

7 INFINIBAND SUBNET MANAGEMENT AND QoS
InﬁniBand Subnet Management and QoS InﬁniBand supports two levels of management packets: subnet management and the general services interface (GSI). High-priority subnet management packets (SMP) are used to discover the topology of the network, attached nodes, and so on, and are transported within the high-priority VLane (which is not subject to ﬂow control). The low-priority GSI management packets handle management functions such as chassis management and other functions not associated with subnet management. These services are not critical to subnet management, so GSI management packets are neither transported within the high-priority VLane nor subject to ﬂow control. InﬁniBand supports quality of service at the link level through virtual lanes. The InﬁniBand virtual lane is a separate logical communication link that shares, with other virtual lanes, a single physical link. Each virtual lane has its own buﬀer and ﬂow-control mechanism implemented at each port in a switch.InﬁniBand allows up to 15 general-purpose virtual lanes plus one additional lane dedicatedfor management traﬃc.Link layer quality of service comes from isolating traﬃc congestion to individual virtual lanes. For example, the link layer will isolate isochronous real-time traﬃc from non-realtime data traﬃc; that is, isolate real-time voice or multimedia streams from Web or FTP data traﬃc. The system manager can assign a higher virtuallane priority to voice traﬃc, in eﬀect scheduling voice packets ahead of congested data packets in each link buﬀer encountered in the voice packets end-to-end path. Thus,the voice traﬃc will still move through the fabric with minimal latency. InﬁniBand presents a number of transport services that provide diﬀerent characteristics. To ensure reliable, sequenced packet delivery, InﬁniBand uses ﬂow control and service levels in conjunction with VLanes to achieve end-to-end QoS. InﬁniBand VLanes are logical channels that share a common physical link, where VLane 15 has the highest priority and is used exclusively for management traﬃc, and VLane=0 the lowest. The concept of a VLane is similar to that of the hardware queues found in routers and switches. For applications that require reliable delivery, InﬁniBand supports reliable delivery of packets using ﬂow control. Within an InﬁniBand network, the receivers on a point-to-point link periodically transmit

12

information to the upstream transmitter to specify the amount of data that can be transmitted without data loss, on a per-VLane basis. The transmitter can then transmit data up to the amount of credits that are advertised by the receiver. If no buﬀer credits exist, data cannot be transmitted. The use of credit-based ﬂow control prevents packet loss that might result from congestion. Furthermore, it enhances application performance, because it avoids packet retransmission. For applications that do not require reliable delivery, InﬁniBand also supports unreliable delivery of packetsi.e. they may be dropped with little or no consequencethat are not subject to ﬂow control; some management traﬃc, for example does not require reliable delivery. At the InﬁniBand network layer, the GRH contains an 8-bit traﬃc class ﬁeld. This value is mapped to a 4-bit service level ﬁeld within the LRH to indicate the service levelmatches the packets service level against a service level-to-VLane table, which has been populated by the subnet manager. The HCA then that the packet is requesting from the InﬁniBand network. As transmits the packet on the VLane associated with that service level. As the packet traverses the network, each switch matches the service level against the packets egress port to identify the VLane within which the packet should be transported.

13

Figure 7: RDMA Hardware

8

REMOTE DIRECT ACESS (RDMA)

One of the key problems with server I/O is the CPU overhead associated with data movement between memory and I/O devices such as LAN and SAN interfaces. InﬁniBand solves this problem by using RDMA to oﬄoad data movement from the server CPU to the InﬁniBand host channel adapter (HCA). RDMA is an extension of hardware-based Direct Memory Access (DMA) capabilities that allows the CPU to delegate data movement within the computer to the DMA hardware.location where data that is associated with a particular process resides and the memory location the data is to be moved to. Once the DMA instructions are sent, the CPU can process other threads while the DMA hardware moves the data. RDMA enables data to be moved from one memory location to another, even if that memory resides on another device.

8.1 Comparing a Traditional Server I/O and RDMA-Enabled I/O
The process in a traditional server i/o is extremely ineﬃcient because it results in multiple copies of the same data traveresing between the

14

Figure 8: Traditional Server I/O
memory system bus and also invokes multiple CPU interrupts and context switches. By Contrast RDMA, an embedded hardware function of the InﬁniBand handles all communications operations without interrupting the CPU.Using RDMA,the sending devices either reads data or writes to the target device user space memory thereby avoiding CPU interrupts and multiple data copies on the memory buswhich enables RDMA to signiﬁcantly reduce the CPU overhead.

15

Figure 9: RDMA-Enabled Server I/O

16

Figure 10: InﬁniBand Protocol Stack

9

INFINIBAND PROTOCOL STACK

From a protocol perspective, the InﬁniBand architecture consists of four layers: physical, link, network, and transport. These layers are analogous to Layers 1 through 4 of the OSI protocol stack.TheInﬁniBand is divided into multiple layers where each layer operates independently of one another.

9.1

Physical Layer

InﬁniBand is a comprehensive architecture that deﬁnes both electrical and mechanical characteristics for the system. These include cables and receptacles and copper media; backplane connectors and hot swap characteristics.InﬁniBand deﬁnes three link speeds at the physical layer,1X,4X,12X.each individual link is a four wire serial connection (two wires in each direction)that provides a full duplex connection at 2.5Gb/s.This physical layer speciﬁes the hardware components.

17

9.2

Link Layer

The link layer (along with the transport layer)is the heart of the InﬁniBand architecture. The link layer encompasses packet layout, pointto-point link operations and switching within a subnet. At the packet communication level two packets types for data transfer and network management are speciﬁed. The management packets provide operational control over device enumeration, subnet directing and fault tolerance. Data packets transfer the actual information with each packet deploying a maximum of four kilobytes of transaction information. Within each speciﬁc device subnet the packet direction and switching properties are directed via a Subnet Manager with 16 bit local identiﬁcation address. The link layer also allows for the Quality Of Service characteristics of InﬁniBand.The primary consideration is the usage of the Virtual Lane(VL) architecture for interconnectivity. Even though a single IBA data path may be deﬁned at the hardware level, the VL approach allows for 16 logical links. With 15 independent levels(VL0-14) and one management path (VL15) available, the ability to conﬁgure device speciﬁc prioritization is available. Since management requires the most priority, VL15 retains the maximum priority. The ability to assert a priority driven architecture lends not only to Quality Of Service but performance as well. Credit Based Flow Control. is also used to manage data ﬂow between two point to point links.Flow control is handled on a per VL basis allowing separate virtual fabrics to maintain communication utilizing the same physical media.

9.3

Network Layer

The network layer handles routing of packets from one subnet to another (within a subnet, the network layer is not required).Packets that sent between subnets contain a Global Route Header (GRH).The GRH contains the 128 bit IPv6 address for the source and destination of the packet. The packets are forward between subnet through router based on each devices 64bit globally unique ID(GUID).The router modiﬁes the LRH with the proper local address within each subnet. Therefore the last router in the path replaces LID in the LRH with the LID of the destination port. Within the network layer InﬁniBand packets do not require the network layer information and the header overhead when used within a single subnet (which is a likely scenario for InﬁniBand system area networks)

18

9.4

Transport Layer

The transport layer is responsible for in-order packet delivery, partioning, channel multiplexing and transport services (reliable connection, reliable datagram, unreliable datagram).The transport layer also handles transaction data segmentation when sending and reassembly when receiving. Based on the Maximum Transfer Unit (MTU) of the path the transport layer divides the data in to packets of the proper size. The receiver reassembles the packets based on a Base Transport Header (BTH) that contains the destination queue pair and packet sequence number. The receiver acknowledges the packets and the sender receives the acknowledge and updates the completion queue with the status of the operation. There is a signiﬁcant improvement that the IBA oﬀers for the transport layer. All functions are implemented in hardware.InﬁniBand speciﬁes multiple transport services for data reliability.

19

10

COMMUNICATION SERVICES

IBA provides several diﬀerent types of communication services between endnodes: Reliable Connection (RC): a connection is established between end nodes, and messages are reliably sent between them. This is optional for TCAs (devices), but mandatory for HCAs (hosts). (Unreliable) Datagram (UD): a single packet message can be sent to an end nodes without ﬁrst establishing a connection; transmission is not guaranteed. Unreliable Connection (UC): a connection is established between end nodes, and messages are sent, but transmission is not guaranteed. This is optional. Reliable Datagram (RD): a single packet message can be reliably sent to any end node without a one-to-one connection. This is optional. Raw IPv6 Datagram Raw Ether type Datagram (optional) (Raw): single-packet unreliable datagram service with all but local transport header information stripped oﬀ; this allows packets using non-IBA transport layers to traverse an IBA network, e.g., for use by routers and network interfaces to transfer packets to other media with minimal modiﬁcation. In the above, reliably send means the data is, barring catastrophic failure, guaranteed to arrive in order, checked for correctness, with its receipt acknowledged. Each packet, even those for unreliable data grams, contains two separate CRCs, one covering data that cannot change (Constant CRC) and one that must be recomputed (V-CRC) since it covers data that change; such change can occur only when a packet moves from one IBA subnet to another, however. This is intentional, since they provide essentially the same services. However, these are designed for hardware implementation, as required by a high-performance I/O system. In addition, the host-side functions have been designed to allow all service types to be used completely in user mode,without necessarily using any operating system services; RDMA moving data directly into or out of the memory of an endnode. This and user mode operation implies that virtual addressing must be supported by the channel adapters, since real addresses are unavailable in user mode. In addition to RDMA, the reliable communication classes also optionally support atomic operations directly against endnodes memory. The atomic operations supported are Fetch-and-Add and Compare-andSwap, both on 64-bit data. Atomics are eﬀectively a variation on RDMA: a combined write and read RDMA, carrying the data.

20

10.1 Communication Stack :InﬁniBand support for the Virtual Interface Architecture (VIA)
The Virtual Interface Architecture is a distributed messaging technology that is both hardware independent and compatible with current network interconnects. The architecture provides an API that can be utilized to provide high speed and low latency communications between peers in clustered applications .InﬁniBand was developed with the VIA architecture in mind.InﬁniBand oﬀ loads traﬃc control from the software client through the use of execution queues. These queues called work queue, are initiated by the client and then left for InﬁniBand to manage. For each communication channel between devices, a Work Queue Pair (WQP-send and receive queue)is assigned at each end. The client places a transaction in to the work queues (Work Queue entry-WQE) which is then processed by the channel adapter from the queue and sent out to the remote device. When the remote device responds the channel adapter returns status to the client through a completion queue or event. The client can post multiple WQEs and the channel adapters hardware will handle each of the communication requests. The channel adapter then generates a Completion Queue Entry (CQE)to provide status for each WQE in the proper prioritized order. This allows the client to continue with the activities while the transactions are being processed.

21

Figure 11: InﬁniBand Protocol Stack

11 INFINIBAND FABRIC VERSUS SHARED BUS
The switched fabric architecture of InﬁniBand is designed around a completely diﬀerent approach as compared as compared to the limited capabilities Of shared bus .IBA speciﬁes a point to point (PTP) communication protocol for primary connectivity. Being based upon PTP,each link along the fabric terminates at one connection point (or device).The actual underlying transport addressing standard is derived from the impressive IP method employed by advanced networks .Each InﬁniBand device is assigned an IP address ,thus the load management and signal termination characteristics are clearly deﬁned and more eﬃcient .To add more TCA connection points or endnodes ,the simple addition of a dedicated IBA switch is required Unlike the shared bus ,each TCA and IBA switch can be interconnected via multiple data paths in order to sustain maximum aggregate device bandwidth and provide fault tolerance by way of multiple redundant connections.

22

23

12

INFINIBRIDGE

InﬁniBridge is eﬀective for implementation of HCAs, TCAs, or standalone switches with very few external components. The devices channel adapter side has a standard 64-bit-wide PCI interface operating at 66 MHz that enables operation with a variety of standard I/O controllers, motherboards, and backplanes. The devices InﬁniBand side is an advanced switch architecture that is conﬁgurable as eight 1ports, two 4ports, or a mix of each. Industry standard external serial/deserializes interface the switch ports to InﬁniBand-supported media (printed circuit board traces, copper cable connectors, or ﬁber transceiver modules). No external memory is required for switching or channel adapter functions. The embedded processor initializes the IC on reset and executes subnet management agent functions in ﬁrmware. An I2C EPROM holds boot conﬁguration. InﬁniBridge also eﬀectively implements managed or unmanaged switch applications. The PCI or CPU interface can connect external controllers running Inﬁni-Band management software. Or an unmanaged switch design can eliminate the processor connection for applications with low area and part count. Appropriate conﬁguration of the ports can implement a 4X to four 1aggregation Switches. The InﬁniBridge switching architecture implements these advanced features of the InﬁniBand architecture: standard InﬁniBand packets up to an MTU size of 4 Kbytes, eight virtual and one management lane, 16Kbyte Unicast local identiﬁcations, 1-Kbyte multicast LIDs, VCRC and ICRC integrity checks, and 4to 1link aggregation.

12.1 Hardware transport performance of InﬁniBridge
Hardware transport is probably the most signiﬁcant feature InﬁniBand oﬀers to next generation data center and telecommunications equipment. Hardware transport performance is primarily a measurement of CPU utilization during a period of a devices maximum wire speed throughput. Lowest CPU utilization is desired. The following test setup was used to evaluate InﬁniBridge hardware transport: two 800-MHz PIII servers with InﬁniBridge64-bit/66-MHz PCI channel adapter cards and running Red Hat Linux 7.1, a 1InﬁniBand link between the two server channel adapters, an InﬁniBand protocol analyzer inserted in the link, and an embedded storage protocol run-

24

ning over the link. The achieved wire speed was 1.89 Gbps in both directions simultaneously, which is 94 percent of the maximum possible bandwidth of a 1link (2.5 Gbps minus 8/10 Byte encoding or 2 Gbps). During this time, the driver used an average of 6.9 percent of the CPU. The bidirectional traﬃc also traverses the PCI bus, which has a unidirectional upper limit of 4.224 Gbps. Although the InﬁniBridge DMA engine can eﬃciently send burst packet data across the PCI bus, we speculate that PCI is the limiting factor in this test case.

25

13 INFINIBRIDGE CHANNEL ADAPTER ARCHITECTURE
The InﬁniBridge channel adapter architecture has two blocks, each having independent ports to the switch fabric, as ﬁgure shows .One block uses a direct memory access (DMA) engine interface to the PCI bus, and the other uses PCI target and PCI master interfaces. This provides ﬂexibility in the use of the PCI bus and enables implementation of the Inﬁni PCI feature. This unique feature lets the transport hardware automatically translate PCI transactions to InﬁniBand packets, thus enabling transparent PCI-to-PCI bridging over the InﬁniBand fabric. Both blocks include hardware transport engines that implement the InﬁniBand features of reliable connection, unreliable datagram, raw datagram, RDMA reads/writes, message size up to 2 Kbytes, and eight virtual lanes. The PCI target includes address bar/limit hardware to claim PCI transactions in segments of the PCI address space. Each segment can be associated with a standard InﬁniBand channel in the PCI-target transport engine. The association lets claimed transactions be translated into InﬁniBand packets that will go out over the corresponding channel. In the reverse direction, the PCI master also has segment hardware that lets a channel automatically translate InﬁniBand packet payload into PCI transactions generated onto the PCI bus. This ﬂexible segment capability and channel association enables transparent PCI bridges construction over the InﬁniBand fabric. The DMA interface can move data directly between local memory and InﬁniBand channels. This process uses execution queues containing linked lists of descriptors that one of multiple DMA execution engines will execute. Each descriptor can contain a multientry scat ter-gather list, and each engine can use this list to gather data from multiple locations in local memory and combine it into a single message to send into an InﬁniBand channel. Similarly, the engines can scatter data received from an InﬁniBand channel to local memory.

26

Figure 13: InﬁniBridge Channel Adapter Architecture

14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE
InﬁniBridge uses an advanced virtual output queuing (VOQ) and cutthrough switching architecture to implement these features with low latency and non blocking performance. Each port has a VOQ buﬀer, transmit scheduling logic, and packet decoding logic.. Incoming data goes to both the VOQ buﬀer and packet-decoding logic. The decoder extracts the parameters needed for ﬂow control, scheduling, and forwarding decisions. Processing of the ﬂow-control inputs gives link ﬂow-control credits to the local transmit port, limiting output packets based on available credits. InﬁniBridge decodes the destination local identiﬁcation from the packet and uses it to index the forwarding database. and retrieve the destination port number. The switch fabric uses the destination port number to decide which port to send the scheduling information. The service level identiﬁcation ﬁeld is also extracted from the input packet by the decoder and used to determine the virtual lane, which goes to the destination ports transmit scheduling logic. All parameter decoding takes place in real time and is given to the switch fabric to make scheduling requests as soon as the

27

Figure 14: Virtual output-queuing architecture
information is available. The packet data is stored only once in the VOQ. The transmit-scheduling logic of each port arbitrates the order of output packets and pulls them from the correct VOQ buﬀer. Each port logic module is actually part of a distributed scheduling architecture that maintains the status of all output ports and receives all scheduling requests. In cut-through mode, a port scheduler receives notiﬁcation of an incoming packet as soon as the local identiﬁcation for that packets destination is decoded. Once the port scheduler receives virtual lane and other scheduling information, it schedules the packet for output. This transmission could start immediately, based on the priority of waiting packets and ﬂow control credits for the packets virtual lane. The switch fabric actually includes three on-chip ports in addition to the eight external ones, as Figure shows. One port is a management tport that connects to the internal RISC processor, which handles management packets and exceptions. The other two ports interface with the channel adapter.

28

15 FORMAL MODEL TO MANAGE INFINIBAND ARBITRATION TABLES TO PROVIDE TO QUALITY OF SERVICE(QoS)
The InﬁniBand Architecture (IBA) has been proposed as an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional bus-based interconnect with a switch-based network for connecting processing nodes and I/O devices. It is being developed by the InﬁniBand Trade Association (IBTA) in the aim to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) required by present and future server systems. For this purpose, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. Therefore, it would be important for InﬁniBand to be able to satisfy both the applications that only need minimum latency, and also those diﬀerent applications that need other characteristics to satisfy their QoS requirements. InﬁniBand provides a series of mechanisms that, properly used, are able to provide QoS for the applications. These mechanisms are mainly the segregation of traﬃc according to categories and the arbitration of the outputports according to an arbitration table that can be conﬁgured to give priority to the packets with higher QoS requirements.

15.1 QoS

THREE MECHANISMS TO PROVIDE

Basically, IBA has three mechanisms to support QoS: service levels, virtual lanes, and virtual lane arbitration.

15.1.1

Service Level

According to the diﬀerent link service levels that an InﬁniBand architecture provide the quality of service at various communication level hence quality provision is greater.

29

15.1.2

Virtual Lanes

IBA ports support virtual lanes (VLs), providing a mechanism for creating multiple virtual links within a single physical link. A VL is an independent set of receiving and transmitting buﬀers associated with a port Each VL must be an independent resource for ﬂow control purposes. IBA ports have to support a minimum of two and a maximum of 16 virtual lanes (VL0 . . . VL15). All ports support VL15, which is reserved exclusively for subnet management, and must always have priority over data traﬃc in the other VLs. Since systems can be constructed with switches supporting diﬀerent numbers of VLs, the number of VLs used by a port is conﬁgured by the subnet manager. Also, packets are marked with a service level (SL), and a relation between SL and VL is established at the input of each link with the SLtoVL Mapping Table. When more than two VLs are implemented, an arbitration mechanism is used to allow an output port to select which virtual lane to transmit from. This arbitration is only for data VLs, because VL15, which transports control traﬃc, always has priority over any other VL. The priorities of the data lanes are deﬁned by the VL Arbitration Table.

15.1.3

Virtual Arbitration table

A limit of high priority value speciﬁes the maximum number of high priority packets that can be sentbefore a low priority packet is sent. More speciﬁ- cally, the VLs of the High Priority table can transmit limit of high priority X4096 bytes before a packet from the Low Priority table could be transmitted. If no high priority packets are ready for transmission at a given time, low priority packets can also be transmitted.When more than two VLs are implemented, the VL Arbitration Table deﬁnes the priorities of the data lanes.Each VL Arbitration Table has two tables: one for delivering packets from high-priority VLs and another one for low priority VLs.Up to 64 table entries are cycled through, each one specifying a VL and a weight. The weight is the number of units of 64 bytes to be sent from that VL. This weight must be in the range of 0 to 255 and is always rounded up in order to transmit a whole packet.

30

Figure 15: Virtual Lanes

16 FORMAL MODEL FOR THE INFINIBAND ARBITRATION TABLE
We present an algorithm to ﬁnd a new sequence of free entries able to locate a connection request in the table. This algorithm is part of a formal model to manage the IBA arbitration table. In the next sections, we will present a formal model to manage the IBA arbitration table and several algorithms in order to adapt this model for being used in a dynamic scenario when new requests and releases are made. To propose a concrete algorithm to ﬁnd a new sequence of free entries able to locate a connection request in the table. The treatment of the problem that we present basically consists of setting out an eﬃcient algorithm able to select a sequence of free entries on the arbitration table. These entries must be selected with a maximum separation between any consecutive pair. To develop this algorithm, we ﬁrst propose some hypotheses and deﬁnitions for establishing the correct frame to later present the algorithm and its associated theorems. we consider some speciﬁc characteristics of IBA:the number of

31

Figure 16: Virtual Arbitration Table

32

table entries (64) and the value of the weight 0 . . . 0.255. All we need to know is that the requests areoriginated by the connections so that some requirements are guaranteed. Besides, the group of entries assigned to a request belongs to the arbitration table associated with the output ports and interfaces of the InﬁniBand switches and hosts, respectively. We formally deﬁne the following concepts: Table: Round list of 64 entries. Entry: Each one of the 64 parts compounding a table. Weight: Numerical value of the entries in the table. This can vary between 0 and 255. Status of an entry: Situation of an entry of the table. The diﬀerent situations can be free weight 0 or occupied weight. Request: A demand of a certain number of entries. Distance: Maximum separation between two consecutive entries in the table that are assigned to one request. Type of request: Each one of the diﬀerent types into which the requests can be grouped. They are based on the requested distances and, so, on the requested number of entries. Group or sequence of entries: The set of entries of the table with a ﬁxed distance between any consecutive pair. In order to characterize a sequence of entries, it will be enough to give the ﬁrst entry and the distance between a consecutive pair.

16.0.4

Initial Hypothesis

In what follows, and when not indicated to the contrary, the following hypotheses will be considered: 1. There are no request eliminations, so the table is ﬁlled in when new requests are received and these requests are never removed. In other words, the entries could change from a free status to an occupied status, but it is not possible for an occupied entry to change to free. This hypothesis permits us to do a more simple and clear initial study, but, logically, it will be discarded later on. 2. It may be necessary to devote more than a group of entries to a set of requests of the same type. 3. The total weight associated with one request is distributed among the entries of the selected sequence so that the weight for the ﬁrst entry of this sequence is always larger than or equal to the weight of the other entries of the sequence.

33

Figure 17: Structure of a VL Arbitration Table
4. The distance d associated to one request will always be a power of 2 and it must be between 1 and 64. These are the diﬀerent types of requests that we are going to consider.

34

17 FILLING IN THE VL ARBITRATION TABLE
The classiﬁcation of traﬃc into categories based on its QoS requirements, is just a ﬁrst step to achieve the objective of providing QoS. A suitable ﬁlling in of the arbitration table is critical. We propose a strategy to ﬁll in the weights for the arbitration tables. In this section, we see how to ﬁll in the table in order to provide the bandwidth requested by each application also on the basis of how to provide latency guarantee. Each arbitration table only has 64 entries, hence we can ﬁll a different entry to each connection, this could limit the number of connections that can be accepted.Also a connection requiring very high bandwidth could also need slots in more than one entry in the table so for that reason, we propose grouping the connections with the same SL into a single entry of the table until completing the maximum weight for that entry, before moving to another free entry. In this way, the number of entries in the table is not a limitation for the acceptance of new connections, but only the available bandwidth. Each set contains the needed entries to meet the request of a certain distance The ﬁrst one of these sets having all of its entries free is selected. The order in which the sets are examined has as an objective to maximize the distance between two free consecutive entries that would remain in the table after carrying out the selection. This way, the table remains in the optimum condition to be able to later meet the most restrictive possible request. For a new request of maximum distance d=2 to the power i.

17.1

Insertion and elimination in the table

The elimination of requests is now possible. As a consequence, the entries used for the eliminated requests will be released. Considering the ﬁlling-in algorithm, and if the entries are not correctly separated. We can eliminate that request

17.1.1

Example 1.

We have the table ﬁlled and two requests of type d is 8 are eliminated. These requests were made using the entries of the sets speciﬁed in

35

the tree This means that, now, the table has free entries, and, so, a request that is not need can be eliminated

17.2

Disfragmentation Algorithm

The basic idea of this algorithm is to group all of the free entries of the table into several free sets that permit meeting any request needing a number of entries equal to or lower than the available table entries. Thus, the objective of the algorithm is to perform a grouping of the free entries.A process that consists of joining the entries of two free sets of the same size in a unique free set.This joining will be eﬀective only if the two free sets do not already belong to the same greater free set. Therefore, the algorithm is restricted to only singular sets. The goal is to have a free set of the biggest size in order to be able to meet a request of this size. For that purpose, the table has enough free entries which, however, belong to two small free sets that are not able to meet that request.

17.3

Reordering Algorithm

The reordering algorithm basically consists of an order algorithm, but applies it at a level of sets. This algorithm has been designed to be applied to a table that is non ordered, with the purpose of leaving the table ordered. So that a ordered table will ensure proper sending of request.

17.4

Global management of the table

For the global management of the table, having both insertions and releases, we have shown that a combination of the ﬁlling-in and disfragmentation algorithms (and even the reordering algorithm, if needed) must be used. Using this global management table to prove that the table will always have a correct status in order that the propositions of the ﬁlling-in algorithm continue to be true.Hence overall management of arbitration table occurs.

36

18

CONCLUSION

The InﬁniBand is a powerful new architecture designed to support I/O connectivity for the Internet infrastructure. InﬁniBand is supported by all major OEM server vendors as a means to expand and create the next generation I/O interconnect standard in servers. IBA that enables QoS: Quality of Services which support with certain mechanisms. These mechanisms are basically service levels, virtual lanes and table based arbitration of virtual lanes.InﬁniBand has a formal model to manage the InﬁniBand to provide QoS,according to this model, each application need a sequence of entries in the IBA arbitration tables based on requirements. These requirements are related to mean bandwidth needed and maximum latency tolerated by this application. It provides a comprehensive silicon software and system solution which provides an overview to layered protocol and InﬁniBands management infrastructure. InﬁniBand provides a layered architecture.Mellanox and related companies are now positioned to release InﬁniBand as a multifaceted architecture within several market segments .The most notable application area is enterprise class network clusters and Internet data centers .These types of application require extreme performance with the maximum in fault tolerance and reliability. Other computing system uses include Internet service providers, collocations hosting and large corporate networks Atleast the for the introduction InﬁniBand is positioned as a complimentary architecture.IBA will move through a transitional period where future PCI,IBA and other interconnect standards can be oﬀered within the same system or network. The understanding of PCI limitation (even PCI-X) should allow InﬁniBand to be an aggressive market contenders higher-class systems move the conversion to IBA devices. Currently Mellanox is developing the IBA software interface standard using linux as their internal OS choice. Another key concern is the cost of implementing InﬁniBand at consumer level. Industry sources are currently projecting IBA prices to fall somewhere between the currently available Gigabit Ethernet and Fibre Channel technologies.Inﬁniband could be positioned as the dominant I/O connectivity architecture at all upper tier levels that provide the top level in Quality of Service(QoS) that can be implemented in various method as discussed. This is deﬁnitely a technology to watch and can provide competitive market.

37

References
[1] Chris Eddington. Inﬁnibridge:an inﬁniband channel adapter with intergrated switch. IEEE Magazine micro, pages 492–524, MarchApril 2006. [2] Sanchez. JL Menduia m;Duato J Alfaro F.J. A formal model to manage the inﬁniband arbitration tables providing qos. In Computer, IEEE Transaction,, page 10241039, August 2007. [3] CISCO Collection Library. UNDERSTANDING INFINIBAND. Cisco Public Informations, Second edition, 2006.

38

rep

Comments

Content

Sponsor Documents

Recommended