Menu

문서정보

Multicast over TCP/IP HOWTO

Multicast over TCP/IP HOWTO

Juan-Mariano de Goyeneche <jmseyas@dit.upm.es>

v1.0, 20 March 1998


This HOWTO tries to cover most aspects related to multicast over TCP/IP networks. So, a lot of information within it is not Linux-specific (just in case you don't use GNU/Linux... yet). Multicast is currently an active area of research and, at the time of writing, many of the "standards" are merely drafts. Keep it in mind while reading the lines that follow.

1. Introduction.

2. Multicast Explained.

3. Kernel requirements and configuration.

4. The MBone.

5. Multicast applications.

6. Multicast programming.

7. The internals.

8. Routing Policies and Forwarding Techniques.

9. Multicast Transport Protocols.

10. References.

11. Copyright and Disclaimer.

12. Acknowledgements.


1. Introduction.

I'll try to give here the most wide range, up to date and accurate information related to multicasting over TCP/IP networks that I can. Any feedback is very welcome. If you find any mistakes in this document, have any comments about its contents or an update or addition, please send them to me at the address listed at the top of this howto.

1.1 What is Multicast.

Multicast is... a need. Well, at least in some scenarios. If you have information (a lot of information, usually) that should be transmitted to various (but usually not all) hosts over an internet, then Multicast is the answer. One common situation in which it is used is when distributing real time audio and video to the set of hosts which have joined a distributed conference.

Multicast is much like radio or TV in the sense that only those who have tuned their receivers (by selecting a particular frequency they are interested on) receive the information. That is: you hear the channel you are interested in, but not the others.

1.2 The problem with Unicast.

Unicast is anything that is not broadcast nor multicast. All right, the definition is not very bright... When you send a packet and there is only one sender process -yours- and one recipient process (the one you are sending the packet to), then this is unicast. TCP is, by its own nature, unicast oriented. UDP supports a lot more paradigms, but if you are sending UDP packets and there is only one precess supposed to receive them, this is unicast too.

For years unicast transmissions proved to be enough for the Internet. It was not until 1993 when the first implementation of multicast saw the light in the 4.4 BSD release. It seems nobody needed it until then. Which were those new problems that multicast addressed?

Needless to say that the Internet has changed a lot since the "early days". Particularly, the appearance of the Web strongly transformed the situation: people didn't just want connections to remote hosts, mail and FTP. First they wanted to see the pictures people placed in their home pages, but later they also wanted to see and hear that people.

With today's technology it is possible to afford the "cost" of making a unicast connection with everyone who wants to see your web page. However, if you are to send audio and video, which needs a huge amount of bandwidth compared with web applications, you have -you had, until multicast came into scene- two options: to establish a separate unicast connection with each of the recipients, or to use broadcast. The first solution is not affordable: if we said that a single connection sending audio/video consumes a huge bandwidth, imagine having to establish hundreds or, may be, thousands of those connections. Both the sending computer and your network would collapse.

Broadcast seems to be a solution, but it's not certainly the solution. If you want all the hosts in your LAN to attend the conference, you may use broadcast. Packets will be sent only once and every host will receive them as they are sent to the broadcast address. The problem is that perhaps only a few of the hosts and not all are interested in those packets. Furthermore: perhaps some hosts are really interested in your conference, but they are outside of your LAN, a few routers away. And you know that broadcast works fine inside a LAN, but problems arise when you want broadcast packets to be routed across different LANs.

The best solution seems to be one in which you send packets to a certain special address (a certain frequency in radio/TV transmissions). Then, all hosts which have decided to join the conference will be aware of packets with that destination address, read them when they traverse the network, and pass them to the IP layer to be demultiplexed. This is similar to broadcasting in that you send only one broadcast packet and all the hosts in the network recognize and read it; it differs, however, in that not all multicast packets are read and processed, but only those that were previously registered in the kernel as being "of interest".

Those special packets are routed at kernel level like any packet because they are IP packets. The only difference might reside in the routing algorithm which tells the kernel where to route or not to route them.


2. Multicast Explained.

2.1 Multicast addresses.

As you probably know, the range of IP addresses is divided into "classes" based on the high order bits of a 32 bits IP address:


   Bit -->  0                           31            Address Range:
           +-+----------------------------+
           |0|       Class A Address      |       0.0.0.0 - 127.255.255.255
           +-+----------------------------+
           +-+-+--------------------------+
           |1 0|     Class B Address      |     128.0.0.0 - 191.255.255.255
           +-+-+--------------------------+
           +-+-+-+------------------------+
           |1 1 0|   Class C Address      |     192.0.0.0 - 223.255.255.255
           +-+-+-+------------------------+
           +-+-+-+-+----------------------+
           |1 1 1 0|  MULTICAST Address   |     224.0.0.0 - 239.255.255.255
           +-+-+-+-+----------------------+
           +-+-+-+-+-+--------------------+
           |1 1 1 1 0|     Reserved       |     240.0.0.0 - 247.255.255.255
           +-+-+-+-+-+--------------------+

The one which concerns us is the "Class D Address". Every IP datagram whose destination address starts with "1110" is an IP Multicast datagram.

The remaining 28 bits identify the multicast "group" the datagram is sent to. Following with the previous analogy, you have to tune your radio to hear a program that is transmitted at some specific frequency, in the same way you have to "tune" your kernel to receive packets sent to an specific multicast group. When you do that, it's said that the host has joined that group in the interface you specified. More on this later.

There are some special multicast groups, say "well known multicast groups", you should not use in your particular applications due the special purpose they are destined to:

All this special multicast groups are regularly published in the "Assigned Numbers" RFC.

In any case, range 224.0.0.0 through 224.0.0.255 is reserved for local purposes (as administrative and maintenance tasks) and datagrams destined to them are never forwarded by multicast routers. Similarly, the range 239.0.0.0 to 239.255.255.255 has been reserved for "administrative scoping" (see section 2.3.1 for information on administrative scoping).

2.2 Levels of conformance.

Hosts can be in three different levels of conformance with the Multicast specification, according to the requirements they meet.

Level 0 is the "no support for IP Multicasting" level. Lots of hosts and routers in the Internet are in this state, as multicast support is not mandatory in IPv4 (it is, however, in IPv6). Not too much explanation is needed here: hosts in this level can neither send nor receive multicast packets. They must ignore the ones sent by other multicast capable hosts.

Level 1 is the "support for sending but not receiving multicast IP datagrams" level. Thus, note that it is not necessary to join a multicast group to be able to send datagrams to it. Very few additions are needed in the IP module to make a "Level 0" host "Level 1-compliant", as shown in section 2.3.

Level 2 is the "full support for IP multicasting" level. Level 2 hosts must be able to both send and receive multicast traffic. They must know the way to join and leave multicast groups and to propagate this information to multicast routers. Thus, they must include an Internet Group Management Protocol (IGMP) implementation in their TCP/IP stack.

2.3 Sending Multicast Datagrams.

By now, it should be obvious that multicast traffic is handled at the transport layer with UDP, as TCP provides point-to-point connections, not feasibles for multicast traffic. (Heavy research is taking place to define and implement new multicast-oriented transport protocols. See section Multicast Transport Protocols for details).

In principle, an application just needs to open a UDP socket and fill with a class D multicast address the destination address where it wants to send data to. However, there are some operations that a sending process must be able to control.

TTL.

The TTL (Time To Live) field in the IP header has a double significance in multicast. As always, it controls the live time of the datagram to avoid it being looped forever due to routing errors. Routers decrement the TTL of every datagram as it traverses from one network to another and when its value reaches 0 the packet is dropped.

The TTL in IPv4 multicasting has also the meaning of "threshold". Its use becomes evident with an example: suppose you set a long, bandwidth consuming, video conference between all the hosts belonging to your department. You want that huge amount of traffic to remain in your LAN. Perhaps your department is big enough to have various LANs. In that case you want those hosts belonging to each of your LANs to attend the conference, but in any case you want to collapse the entire Internet with your multicast traffic. There is a need to limit how "long" multicast traffic will expand across routers. That's what the TTL is used for. Routers have a TTL threshold assigned to each of its interfaces, and only datagrams with a TTL greater than the interface's threshold are forwarded. Note that when a datagram traverses a router with a certain threshold assigned, the datagram's TTL is not decremented by the value of the threshold. Only a comparison is made. (As before, the TTL is decremented by 1 each time a datagram passes across a router).

A list of TTL thresholds and their associated scope follows:


TTL     Scope
----------------------------------------------------------------------
   0    Restricted to the same host. Won't be output by any interface.
   1    Restricted to the same subnet. Won't be forwarded by a router.
 <32         Restricted to the same site, organization or department.
 <64 Restricted to the same region.
<128 Restricted to the same continent.
<255 Unrestricted in scope. Global.

Nobody knows what "site" or "region" mean exactly. It is up to the administrators to decide what this limits apply to.

The TTL-trick is not always flexible enough for all needs, specially when dealing with overlapping regions or trying to establish geographic, topologic and bandwidth limits simultaneously. To solve this problems, administratively scoped IPv4 multicast regions were established in 1994. (see D. Meyer's "Administratively Scoped IP Multicast" Internet draft). It does scoping based on multicast addresses rather than on TTLs. The range 239.0.0.0 to 239.255.255.255 is reserved for this administrative scoping.

Loopback.

When the sending host is Level 2 conformant and is also a member of the group datagrams are being sent to, a copy is looped back by default. This does not mean that the interface card reads its own transmission, recognizes it as belonging to a group the interface belongs to, and reads it from the network. On the contrary, is the IP layer which, by default, recognizes the to-be-sent datagram and copies and queues it on the IP input queue before sending it.

This feature is desirable in some cases, but not in others. So the sending process can turn it on and off at wish.

Interface selection.

Hosts attached to more than one network should provide a way for applications to decide which network interface will be used to output the transmissions. If not specified, the kernel chooses a default one based on system administrator's configuration.

2.4 Receiving Multicast Datagrams.

Joining a Multicast Group.

Broadcast is (in comparison) easier to implement than multicast. It doesn't require processes to give the kernel some rules regarding what to do with broadcast packets. The kernel just knows what to do: read and deliver all of them to the proper applications.

With multicast, however, it is necessary to advise the kernel which multicast groups we are interested in. That is, we have to ask the kernel to "join" those multicast groups. Depending on the underlying hardware, multicast datagrams are filtered by the hardware or by the IP layer (and, in some cases, by both). Only those with a destination group previously registered via a join are accepted.

Essentially, when we join a group we are telling the kernel: "OK. I know that, by default, you ignore multicast datagrams, but remember that I am interested in this multicast group. So, do read and deliver (to any process interested in them, not only to me) any datagram that you see in this network interface with this multicast group in its destination field".

Some considerations: first, note that you don't just join a group. You join a group on a particular network interface. Of course, it is possible to join the same group on more than one interface. If you don't specify a concrete interface, then the kernel will choose it based on its routing tables when datagrams are to be sent. It is also possible that more than one process joins the same multicast group on the same interface. They will all receive the datagrams sent to that group via that interface.

As said before, any multicast-capable hosts join the all-hosts group at start-up , so "pinging" 224.0.0.1 returns all hosts in the network that have multicast enabled.

Finally, consider that for a process to receive multicast datagrams it has to ask the kernel to join the group and bind the port those datagrams were being sent to. The UDP layer uses both the destination address and port to demultiplex the packets and decide which socket(s) deliver them to.

Leaving a Multicast Group.

When a process is no longer interested in a multicast group, it informs the kernel that it wants to leave that group. It is important to understand that this doesn't mean that the kernel will no longer accept multicast datagrams destined to that multicast group. It will still do so if there are more precesses who issued a "multicast join" petition for that group and are still interested. In that case the host remains member of the group, until all the processes decide to leave the group.

Even more: if you leave the group, but remain bound to the port you were receiving the multicast traffic on, and there are more processes that joined the group, you will still receive the multicast transmissions.

The idea is that joining a multicast group only tells the IP and data link layer (which in some cases explicitly tells the hardware) to accept multicast datagrams destined to that group. It is not a per-process membership, but a per-host membership.

Mapping of IP Multicast Addresses to Ethernet/FDDI addresses.

Both Ethernet and FDDI frames have a 48 bit destination address field. In order to avoid a kind of multicast ARP to map multicast IP addresses to ethernet/FDDI ones, the IANA reserved a range of addresses for multicast: every ethernet/FDDI frame with its destination in the range 01-00-5e-00-00-00 to 01-00-5e-ff-ff-ff (hex) contains data for a multicast group. The prefix 01-00-5e identifies the frame as multicast, the next bit is always 0 and so only 23 bits are left to the multicast address. As IP multicast groups are 28 bits long, the mapping can not be one-to-one. Only the 23 least significant bits of the IP multicast group are placed in the frame. The remaining 5 high-order bits are ignored, resulting in 32 different multicast groups being mapped to the same ethernet/FDDI address. This means that the ethernet layer acts as an imperfect filter, and the IP layer will have to decide whether to accept the datagrams the data-link layer passed to it. The IP layer acts as a definitive perfect filter.

Full details on IP Multicasting over FDDI are given in RFC 1390: "Transmission of IP and ARP over FDDI Networks". For more information on mapping IP Multicast addresses to ethernet ones, you may consult draft-ietf-mboned-intro-multicast-03.txt: "Introduction to IP Multicast Routing".

If you are interested in IP Multicasting over Token-Ring Local Area Networks, see RFC 1469 for details.


3. Kernel requirements and configuration.

Linux is, of course (you doubted it?), full Level-2 Multicast-Compliant. It meets all requirements to send, receive and act as a router (mrouter) for multicast datagrams.

If you want just to send and receive, you must say yes to "IP: multicasting" when configuring your kernel. If you also want your Linux box to act as a multicast router (mrouter) you also need to enable multicast routing in the kernel by selecting "IP: forwarding/gatewaying", "IP: multicast routing" and "IP: tunneling", the latter because new versions of mrouted relay on IP tunneling to send multicast datagrams encapsulated into unicast ones. This is necessary when establishing tunnels between multicast hosts separated by unicast-only networks and routers. (The mrouted is a daemon that implements the multicast routing algorithm -the routing policy- and instructs the kernel on how to route multicast datagrams).

Some kernel versions label multicast routing as "EXPERIMENTAL", so you should enable "Prompt for development and/or incomplete code/drivers" in the "Code maturity level options" section.

If, when running the mrouted, traffic generated in the same network your Linux box is connected to is correctly forwarded to the other network, but you can't see the other's network traffic on your local network, check whether you are receiving ICMP protocol error messages. Almost sure you forgot to turn on IP tunneling in your Linux router. It's a kind of stupid error when you know it but, believe me, its quite time-consuming when you don't, and there is no apparent reason that explains what is going wrong. A sniffer proves to be quite useful in these situations!

(You can see more on multicast routing on section Routing Policies and Forwarding Techniques; mrouted and tunnels are also explained in sections The MBone and Multicast applications).

Once you have compiled and installed your new kernel, you should provide a default route for multicast traffic. The goal is to add a route to the network 224.0.0.0.

The problem most people seem to face in this stage of the configuration is with the value of the mask to supply. If you have read Terry Dawson's excellent NET-3-HOWTO, it should not be difficult to guess the correct value, though. As explained there, the netmask is a 32 bit number filled with all-1s in the network part of your IP address, and with all-0s in the host part. Recall from section 2.1 that a class D multicast address has no netwok/host sections. Instead it has a 28-bit group identifier and a 4-bit class D identifier. Well, this 4 bits are the network part and the remaining 28 the host part. So the netmask needed is 11110000000000000000000000000000 or, easier to read: 240.0.0.0. Then, the full command should be:

route add 224.0.0.0 netmask 240.0.0.0 dev eth0

Depending on how old your route program is, you might need to add the -net flag after the add.

Here we supposed that eth0 was multicast-capable and that, when not otherwise specified, we wanted multicast traffic to be output there. If this is not your case, change the dev parameter as appropriate.

The /proc filesystem proves here to be useful once again: you can check /proc/net/igmp to see the groups your host is currently subscribed to.


4. The MBone.

Using a new technology usually carries some advantages and disadvantages. The advantages of multicast are -I think- clear. The main disadvantage is that hundreds of hosts and, specially, routers don't support it yet. As a consequence, people who started working on multicast, bought new equipment, modified their operating systems, and built multicast islands in their local places. Then they discovered that it was difficult to communicate with people doing similar things because if only one of the routers between them didn't support multicast there was nothing to do...

The solution was clear: they decided to build a virtual multicast network in the top of the Internet. That is: sites with multicast routers between them could communicate directly. But sites joined across unicast routers would send their island's multicast traffic encapsulated in unicast packets to other multicast islands. Routers in the middle would not have problems, as they would be dealing with unicast traffic. Finally, in the receiving site, traffic would be de-encapsulated, and sent to the island in the original multicast way. Two ends converting from multicast to unicast, and then again to multicast define what is called a multicast tunnel.

The MBone or Multicast Backbone is that virtual multicast network based on multicast islands connected by multicast tunnels.

Several activities take place in the MBone daily, but it deserves to be remarked the profusion of tele-conferences with real time audio and video taking place across the whole Internet. As an example, it was recently transmitted (live) the talk Linus Torvalds gave to the Silicon Valley Linux Users Group.

For more information on the MBone, see:

http://www.mediadesign.co.at/newmedia/more/mbone-faq.html


5. Multicast applications.

Most people dealing with multicast, sooner or later decide to connect to the MBone, and then they usually need an mrouted. You'll also need it if you don't have a multicast-capable router and you want multicast traffic generated in one of your subnets to be "heard" on another. mrouted does circunvect the problem of sending multicast traffic across unicast routers -it encapsulates multicast datagrams into unicast ones (IP into IP)- but this is not the only feature it provides. Most important, it instructs the kernel on how to route (or not-to-route) multicast datagrams based on their source and destination. So, even having a multicast capable router, mrouted can be used to tell it what to do with the datagrams (note I said what, and not how; mrouted says "forward this to the network connected to that interface", but actual forwarding is performed by the kernel). This distinction between actual-forwarding and the algorithm that decides who and how to forward is very useful as it allows to write forwarding code only once and place it into the kernel. Forwarding algorithms and policies are then implemented in user space daemons, so it is very easy to change from one policy to another without the need of kernel re-compilation.

You can get a version of mrouted ported to Linux from:

ftp://www.video.ja.net/mice/mrouted/Linux/. This site is mirrored all across the world. Be sure to read the ftp://www.video.ja.net/mice/README.mirrors file to choose the one nearest you.

Next, we'll focus specially on multicast applications written to connect to the MBone, which have been ported to Linux. The list is picked up from Michael Esler's "Linux Multicast Information" page http://www.cs.virginia.edu/~mke2e/multicast/. I recommend you that page for lots of information and resources on multicast and Linux.

Audio Conferencing

Video Conferencing

Other Utilities

Session Tools

I placed session tools later because I think they deserve some explanation. When a conference takes places, several multicast groups and ports are assigned to each service you want for your conference (audio, video, shared white-boards, etc...) Announces of the conferences that will take place, along with information on multicast groups, ports and programs that will be used (vic, vat, ...) are periodically multicasted to the MBone. Session tools "hear" this information and present you in an easy way which conferences are taking (or will take) place, so you can decide which interest you. Also, they facilitate the task of joining a session. Instead of launching each program that will be used and telling which multicast group/port to join, you usually just need to click and the session tool launches the proper programs suppling them all information needed to join the conference. Session tools usually let you announce your own conferences on the MBone.


6. Multicast programming.

Multicast programming... or writing your own multicast applications.

Several extensions to the programming API are needed in order to support multicast. All of them are handled via two system calls: setsockopt() (used to pass information to the kernel) and getsockopt() (to retrieve information regarded multicast behavior). This does not mean that 2 new system calls were added to support multicast. The pair setsockopt()/getsockopt() has been there for years. Since 4.2 BSD at least. The addition consists on a new set of options (multicast options) that are passed to these system calls, that the kernel must understand.

The following are the setsockopt()/getsockopt() function prototypes:

int getsockopt(int s, int level, int optname, void* optval, int* optlen);

int setsockopt(int s, int level, int optname, const void* optval, int optlen);

The first parameter, s, is the socket the system call applies to. For multicasting, it must be a socket of the family AF_INET and its type may be either SOCK_DGRAM or SOCK_RAW. The most common use is with SOCK_DGRAM sockets, but if you plan to write a routing daemon or modify some existing one, you will probably need to use SOCK_RAW ones.

The second one, level, identifies the layer that is to handle the option, message or query, whatever you want to call it. So, SOL_SOCKET is for the socket layer, IPPROTO_IP for the IP layer, etc... For multicast programming, level will always be IPPROTO_IP.

optname identifies the option we are setting/getting. Its value (either supplied by the program or returned by the kernel) is optval. The optnames involved in multicast programming are the following:


                        setsockopt()            getsockopt()
IP_MULTICAST_LOOP           yes                     yes
IP_MULTICAST_TTL            yes                     yes
IP_MULTICAST_IF             yes                     yes
IP_ADD_MEMBERSHIP           yes                      no
IP_DROP_MEMBERSHIP          yes                      no

optlen carries the size of the data structure optval points to. Note that in getsockopt() it is a value-result rather than a value: the kernel writes the value of optname in the buffer pointed by optval and informs us of that value's size via optlen.

Both setsockopt() and getsockopt() return 0 on success and -1 on error.

6.1 IP_MULTICAST_LOOP.

You have to decide, as the application writer, whether you want the data you send to be looped back to your host or not. If you plan to have more than one process or user "listening", loopback must be enabled. On the other hand, if you are sending the images your video camera is producing, you probably don't want loopback, even if you want to see yourself on the screen. In that latter case, your application will probably receive the images from a device attached to the computer and send them to the socket. As the application already "has" that data, it is improbable it wants to receive it again on the socket. Loopback is by default enabled.

Regard that optval is a pointer. You can't write:

setsockopt(socket, IPPROTO_IP, IP_MULTICAST_LOOP, 0, 1);
to disable loopback. Instead write:
u_char loop;
setsockopt(socket, IPPROTO_IP, IP_MULTICAST_LOOP, &loop, sizeof(loop));
and set loop to 1 to enable loopback or 0 to disable it.

To know whether a socket is currently looping-back or not use something like:

u_char loop;
int size;

getsockopt(socket, IPPROTO_IP, IP_MULTICAST_LOOP, &loop, &size)

6.2 IP_MULTICAST_TTL.

If not otherwise specified, multicast datagrams are sent with a default value of 1, to prevent them to be forwarded beyond the local network. To change the TTL to the value you desire (from 0 to 255), put that value into a variable (here I name it "ttl") and write somewhere in your program:

u_char ttl;
setsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, sizeof(ttl));

The behavior with getsockopt() is similar to the one seen on IP_MULTICAST_LOOP.

6.3 IP_MULTICAST_IF.

Usually, the system administrator specifies the default interface multicast datagrams should be sent from. The programmer can override this and choose a concrete outgoing interface for a given socket with this option.

struct in_addr interface_addr;
setsockopt (socket, IPPROTO_IP, IP_MULTICAST_IF, &interface_addr, sizeof(interface_addr));

>From now on, all multicast traffic generated in this socket will be output from the interface chosen. To revert to the original behavior and let the kernel choose the outgoing interface based on the system administrator's configuration, it is enough to call setsockopt() with this same option and INADDR_ANY in the interface field.

In determining or selecting outgoing interfaces, the following ioctls might be useful: SIOCGIFADDR (to get an interface's address), SIOCGIFCONF (to get the list of all the interfaces) and SIOCGIFFLAGS (to get an interface's flags and, thus, determine whether the interface is multicast capable or not -the IFF_MULTICAST flag-).

If the host has more than one interface and the IP_MULTICAST_IF option is not set, multicast transmissions are sent from the default interface, although the remainding interfaces might be used for multicast forwarding if the host is acting as a multicast router.

6.4 IP_ADD_MEMBERSHIP.

Recall that you need to tell the kernel which multicast groups you are interested in. If no process is interested in a group, packets destined to it that arrive to the host are discarded. In order to inform the kernel of your interests and, thus, become a member of that group, you should first fill a ip_mreq structure which is passed later to the kernel in the optval field of the setsockopt() system call.

The ip_mreq structure (taken from /usr/include/linux/in.h) has the following members:

struct ip_mreq
{
        struct in_addr imr_multiaddr;   /* IP multicast address of group */
        struct in_addr imr_interface;   /* local IP address of interface */
};

(Note: the "physical" definition of the structure is in the file above specified. Nonetheless, you should not include <linux/in.h> if you want your code to be portable. Instead, include <netinet/in.h> which, in turn, includes <linux/in.h> itself).

The first member, imr_multiaddr, holds the group address you want to join. Remember that memberships are also associated with interfaces, not just groups. This is the reason you have to provide a value for the second member: imr_interface. This way, if you are in a multihomed host, you can join the same group in several interfaces. You can always fill this last member with the wildcard address (INADDR_ANY) and then the kernel will deal with the task of choosing the interface.

With this structure filled (say you defined it as: struct ip_mreq mreq;) you just have to call setsockopt() this way:

setsockopt (socket, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));

Notice that you can join several groups to the same socket, not just one. The limit to this is IP_MAX_MEMBERSHIPS and, as of version 2.0.33, it has the value of 20.

6.5 IP_DROP_MEMBERSHIP.

The process is quite similar to joining a group:

struct ip_mreq mreq;
setsockopt (socket, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));

where mreq is the same structure with the same data used when joining the group. If the imr_interface member is filled with INADDR_ANY, the first matching group is dropped.

If you have joined a lot of groups to the same socket, you don't need to drop memberships in all of them in order to terminate. When you close a socket, all memberships associated with it are dropped by the kernel. The same occurs if the process that opened the socket is killed.

Finally, keep in mind that a process dropping membership for a group does not imply that the host will stop receiving datagrams for that group. If another socket joined that group in that same interface previously to this IP_DROP_MEMBERSHIP, the host will keep being a member of that group.

Both ADD_MEMBERSHIP and DROP_MEMBERSHIP are nonblocking operations. They should return immediately indicating either success or failure.


7. The internals.

This section's aim is to provide some information, not needed to reach a basic understanding on how multicast works nor to be able to write multicast programs, but which is very interesting, gives some insight on the underlying multicast protocols and implementations, and may be useful to avoid common errors and misunderstandings.

7.1 IGMP.

When talking about IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP, we said that the information provided by this "commands" was used by the kernel to choose which multicast datagrams accept or discard. This is true, but it is not all the truth. Such a simplification would imply that multicast datagrams for all multicast groups around the world would be received by our host, and then it would check the memberships issued by processes running on it to decide whether to pass the traffic to them or to throw it out. As you can imagine, this is a complete bandwidth waste.

What actually happens is that hosts instruct their routers telling them which multicast groups they are interested in; then, those routers tell their up-stream routers they want to receive that traffic, and so on. Algorithms employed for making the decision of when to ask for a group's traffic or saying that it is not desired anymore, vary a lot. There's something, however, that never changes: how this information is transmitted. IGMP is used for that. It stands for Internet Group Management Protocol. It is a new protocol, similar in many aspects to ICMP, with a protocol number of 2, whose messages are carried in IP datagrams, and which all level 2-compliant host are required to implement.

As said before, it is used both by hosts giving membership information to its routers, and by routers to communicate between themselves. In the following I'll cover only the hosts-routers relationships, mainly because I was unable to find information describing router to router communication other than the mrouted source code (rfc 1075 describing the Distance Vector Multicast Routing Protocol is now obsoleted, and mrouted implements a modified DVMRP not yet documented).

IGMP version 0 is specified in RFC-988 which is now obsoleted. Almost no one uses version 0 now.

IGMP version 1 is described in RFC-1112 and, although it is updated by RFC-2236 (IGMP version 2) it is in wide use still. The Linux kernel implements the full IGMP version 1 and parts of version 2 requirements, but not all.

Now I'll try to give an informal description of the protocol. You can check RFC-2236 for an in-proof formal description, with lots of state diagrams and time-out boundaries.

All IGMP messages have the following structure:


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      Type     | Max Resp Time |           Checksum            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Group Address                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

IGMP version 1 (hereinafter IGMPv1) labels the "Max Resp Time" as "Unused", zeroes it when sent, and ignores it when received. Also, it brakes the "Type" field in two 4-bits wide fields: "Version" and "Type". As IGMPv1 identifies a "Membership Query" message as 0x11 (version 1, type 1) and IGMPv2 as 0x11 too, the 8 bits have the same effective interpretation.

I think it is more instructive to give first the IGMPv1 description and next point out the IGMPv2 additions, as they are mainly that, additions.

For the following discussions it is important to remember that multicast routers receive all IP multicast datagrams.

IGMP version 1.

Routers periodically send IGMP Host Membership Queries to the all-hosts group (224.0.0.1) with a TTL of 1 (once every minute or two). All multicast-capable hosts hear them, but don't answer immediately to avoid an IGMP Host Membership Report storm. Instead, they start a random delay timer for each group they belong to on the interface they received the query.

Sooner or later, the timer expires in one of the hosts, and it sends an IGMP Host Membership Report (also with TTL 1) to the multicast address of the group being reported. As it is sent to the group, all hosts that joined the group -and which are currently waiting for their own timer to expire- receive it, too. Then, they stop their timers and don't generate any other report. Just one is generated -by the host that chose the smaller timeout-, and that is enough for the router. It only needs to know that there are members for that group in the subnet, not how many nor which.

When no reports are received for a given group after a certain number of queries, the router assumes that no members are left, and thus it doesn't have to forward traffic for that group on that subnet. Note that in IGMPv1 there are no "Leave Group messages".

When a host joins a new group, the kernel sends a report for that group, so that the respective process needs not to wait a minute or two until a new membership query is received. As you can see this IGMP packet is generated by the kernel as a response to the IP_ADD_MEMBERSHIP command, seen in section IP_ADD_MEMBERSHIP. Note the emphasis in the adjective "new": if a process issues an IP_ADD_MEMBERSHIP command for a group the host is already a member of, no IGMP packets are sent as we must already be receiving traffic for that group; instead, a counter for that group's use is incremented. IP_DROP_MEMBERSHIP generates no datagrams in IGMPv1.

Host Membership Queries are identified by Type 0x11, and Host Membership Reports by Type 0x12.

No reports are sent for the all-hosts group. Membership in this group is permanent.

IGMP version 2.

One important addition to the above is the inclusion of a Leave Group message (Type 0x17). The reason is to reduce the bandwidth waste between the time the last host in the subnet drops membership and the time the router times-out for its queries and decides there are no more members present for that group (leave latency). Leave Group messages should be addressed to the all-routers group (224.0.0.2) rather than to the group being left, as that information is of no use for other members (kernel versions up to 2.0.33 send them to the group; although it does no harm to the hosts, it's a waste of time as they have to process them, but don't gain useful information). There are certain subtle details regarding when and when-not to send Leave Messages; if interested, see the RFC.

When an IGMPv2 router receives a Leave Message for a group, it sends Group-Specific Queries to the group being left. This is another addition. IGMPv1 has no group-specific queries. All queries are sent to the all-hosts group. The Type in the IGMP header does not change (0x11, as before), but the "Group Address" is filled with the address of the multicast group being left.

The "Max Resp Time" field, which was set to 0 in transmission and ignored on reception in IGMPv1, is meaningful only in "Membership Query" messages. It gives the maximum time allowed before sending a report in units of 1/10 second. It is used as a tune mechanism.

IGMPv2 adds another message type: 0x16. It is a "Version 2 Membership Report" sent by IGMPv2 hosts if they detect an IGMPv2 router is present (an IGMPv2 host knows an IGMPv1 router is present when it receives a query with the "Max Response" field set to 0).

When more than one router claims to act as querier, IGMPv2 provides a mechanism to avoid "discussions": the router with the lowest IP address is designed to be querier. The other routers keep timeouts. If the router with lower IP address crashes or is shutdown, the decision of who will be the querier is taken again after the timers expire.

7.2 Kernel corner.

This sub-section gives some start-points to study the multicast implementation of the Linux kernel. It does not explain that implementation. It just says where to find things.

The study was carried over version 2.0.32, so it could be a bit outdated by the time you read it (network code seems to have changed A LOT in 2.1.x releases, for instance).

Multicast code in the Linux kernel is always surrounded by #ifdef CONFIG_IP_MULTICAST / #endif pairs, so that you can include/ exclude it from your kernel based on your needs (this inclusion/exclusion is done at compile time, as you probably know if reading that section... #ifdefs are handled by the preprocessor. The decision is made based in what you selected when doing either a make config, make menuconfig or make xconfig).

You might want multicast features, but if your Linux box is not going to act as a multicast router you will probably not want multicast router features included in your new kernel. For this you have the multicast routing code surrounded by #ifdef CONFIG_IP_MROUTE / #endif pairs.

Kernel sources are usually placed in /usr/src/linux. However, the place may change so, both for accuracy and brevity, I will refer to the root directory of the kernel sources as just LINUX. Then, something like LINUX/net/ipv4/udp.c should be the same as /usr/src/linux/net/ipv4/udp.c if you unpacked the kernel sources in the /usr/src/linux directory.

All multicast interfaces with user programs shown in the section devoted to multicast programming were driven across the setsockopt()/ getsockopt() system calls. Both of them are implemented by means of functions that make some tests to verify the parameters passed to them and which, in turn, call another function that makes some additional tests, demultiplexes the call based on the level parameter to either system call, and then calls another function which... (if interested in all this jumps, you can follow them in LINUX/net/socket.c (functions sys_socketcall() and sys_setsockopt(), LINUX/net/ipv4/af_inet.c (function inet_setsockopt()) and LINUX/net/ipv4/ip_sockglue.c (function ip_setsockopt()) ).

The one which interests us is LINUX/net/ipv4/ip_sockglue.c. Here we find ip_setsockopt() and ip_getsockopt() which are mainly a switch (after some error checking) verifying each possible value for optname. Along with unicast options, all multicast ones seen here are handled: IP_MULTICAST_TTL, IP_MULTICAST_LOOP, IP_MULTICAST_IF, IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP. Previously to the switch, a test is made to determine whether the options are multicast router specific, and if so, they are routed to the ip_mroute_setsockopt() and ip_mroute_getsockopt() functions (file LINUX/net/ipv4/ipmr.c).

In LINUX/net/ipv4/af_inet.c we can see the default values we talked about in previous sections (loopback enabled, TTL=1) provided when the socket is created (taken from function inet_create() in this file):


 
#ifdef CONFIG_IP_MULTICAST
        sk->ip_mc_loop=1;
        sk->ip_mc_ttl=1;
        *sk->ip_mc_name=0;
        sk->ip_mc_list=NULL;
#endif

Also, the assertion of "closing a socket makes the kernel drop all memberships this socket had" is corroborated by:


#ifdef CONFIG_IP_MULTICAST
                /* Applications forget to leave groups before exiting */
                ip_mc_drop_socket(sk);
#endif

taken from inet_release(), on the same file as before.

Device independent operations for the Link Layer are kept in LINUX/net/core/dev_mcast.c.

Two important functions are still missing: the processing of input and output multicast datagrams. As any other datagrams, incoming datagrams are passed from the device drivers to the ip_rcv() function (LINUX/net/ipv4/ip_input.c). In this function is where the perfect filtering is applied to multicast packets that crossed the devices layer (recall that lower layers only perform best-effort filtering and is IP who 100% knows whether we are interested in that multicast group or not). If the host is acting as a multicast router, this function decides too whether the datagram should be forwarded and calls ipmr_forward() appropriately. (ipmr_forward() is implemented in LINUX/net/ipv4/ipmr.c).

Code in charge of out-putting packets is kept in LINUX/net/ipv4/ip_output.c. Here is where the IP_MULTICAST_LOOP option takes effect, as it is checked to see whether to loop back the packets or not (function ip_queue_xmit()). Also the TTL of the outgoing packet is selected based on whether it is a multicast or unicast one. In the former case, the argument passed to the IP_MULTICAST_TTL option is used (function (ip_build_xmit()).

While working with mrouted (a program which gives the kernel information about how to route multicast datagrams), we detected that all multicast packets originated on the local network were properly routed..., except the ones from the Linux box that was acting as the multicast router!! ip_input.c was working OK, but it seemed ip_output.c wasn't. Reading the source code for the output functions, we found that outgoing datagrams were not being passed to ipmr_forward(), the function that had to decide whether they should be routed or not. The packets were outputed to the local network but, as network cards are usually unable to read their own transmissions, those datagrams were never routed. We added the necessary code to the ip_build_xmit() function and everything was OK again. (Having the sources for your kernel is not a luxury or pedantry; it's a need!)

ipmr_forward() has been mentioned a couple of times. It is an important function as it solves one important misunderstanding that appears to be widely expanded. When routing multicast traffic, it is not mrouted who makes the copies and sends them to the proper recipients. mrouted receives all multicast traffic and, based on that information, computes the multicast routing tables and tells the kernel how to route: "datagrams for this group coming from that interface should be forwarded to those interfaces". This information is passed to the kernel by calls to setsockopt() on a raw socket opened by the mrouted daemon (the protocol specified when the raw socket was created must be IPPROTO_IGMP). This options are handled in the ip_mroute_setsockopt() function from LINUX/net/ipv4/ipmr.c. The first option (would be better to call them commands rather than options) issued on that socket must be MRT_INIT. All other commands are ignored (returning -EACCES) if MRT_INIT is not issued first. Only one instance of mrouted can be running at the same time in the same host. To keep track of this, when the first MRT_INIT is received, an important variable, struct sock* mroute_socket, is pointed to the socket MRT_INIT was received on. If mroute_socket is not null when attending an MRT_INIT this means another mrouted is already running and -EADDRINUSE is returned. All resting commands (MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC, MRT_DEL_MFC and MRT_ASSERT) return -EACCES if they come from a socket different than mroute_socket.

As routed multicast datagrams can be received/sent across either physical interfaces or tunnels, a common abstraction for both was devised: VIFs, Virtual InterFaces. mrouted passes vif structures to the kernel, indicating physical or tunnel interfaces to add to its routing tables, and multicast forwarding entries saying where to forward datagrams.

VIFs are added with MRT_ADD_VIF and deleted with MRT_DEL_VIF. Both pass a struct vifctl to the kernel (defined in /usr/include/linux/mroute.h) with the following information:


struct vifctl {
        vifi_t  vifc_vifi;              /* Index of VIF */
        unsigned char vifc_flags;       /* VIFF_ flags */
        unsigned char vifc_threshold;   /* ttl limit */
        unsigned int vifc_rate_limit;   /* Rate limiter values (NI) */
        struct in_addr vifc_lcl_addr;   /* Our address */
        struct in_addr vifc_rmt_addr;   /* IPIP tunnel addr */
};

With this information a vif_device structure is built:


struct vif_device
{
        struct device   *dev;                   /* Device we are using */
        struct route    *rt_cache;              /* Tunnel route cache */
        unsigned long   bytes_in,bytes_out;
        unsigned long   pkt_in,pkt_out;         /* Statistics */
        unsigned long   rate_limit;             /* Traffic shaping (NI) */
        unsigned char   threshold;              /* TTL threshold */
        unsigned short  flags;                  /* Control flags */
        unsigned long   local,remote;           /* Addresses(remote for tunnels)*/
};

Note the dev entry in the structure. The device structure is defined in /usr/include/linux/netdevice.h file. It is a big structure, but the field that interests us is:


  struct ip_mc_list*    ip_mc_list;   /* IP multicast filter chain    */

The ip_mc_list structure -defined in /usr/include/linux/igmp.h- is as follows:


struct ip_mc_list
{
        struct device *interface;
        unsigned long multiaddr;
        struct ip_mc_list *next;
        struct timer_list timer;
        short tm_running;
        short reporter;
        int users;
};

So, the ip_mc_list member from the dev structure is a pointer to a linked list of ip_mc_list structures, each containing an entry for each multicast group the network interface is a member of. Here again we see membership is associated to interfaces. LINUX/net/ipv4/ip_input.c traverses this linked list to decide whether the received datagram is destined to any group the interface that received the datagram belongs to:


#ifdef CONFIG_IP_MULTICAST
                if(!(dev->flags&IFF_ALLMULTI) && brd==IS_MULTICAST 
                   && iph->daddr!=IGMP_ALL_HOSTS 
                   && !(dev->flags&IFF_LOOPBACK))
                {
                        /*
                         *      Check it is for one of our groups
                         */
                        struct ip_mc_list *ip_mc=dev->ip_mc_list;
                        do
                        {
                                if(ip_mc==NULL)
                                {
                                        kfree_skb(skb, FREE_WRITE);
                                        return 0;
                                }
                                if(ip_mc->multiaddr==iph->daddr)
                                        break;
                                ip_mc=ip_mc->next;
                        }
                        while(1);
                }
#endif

The users field in the ip_mc_list structure is used to implement what was said in section IGMP version 1: if a process joins a group and the interface is already a member of that group (ie, another process joined that same group in that same interface before) only the count of members (users) is incremented. No IGMP messages are sent, as you can see in the following code (taken from ip_mc_inc_group(), called by ip_mc_join_group(), both in LINUX/net/ipv4/igmp.c):


        for(i=dev->ip_mc_list;i!=NULL;i=i->next)
        {
                if(i->multiaddr==addr)
                {
                        i->users++;
                        return;
                }
        }

When dropping memberships, the counter is decremented and additional operations are performed only when the count reaches 0 (ip_mc_dec_group()).

MRT_ADD_MFC and MRT_DEL_MFC set or delete forwarding entries in the multicast routing tables. Both pass a struct mfcctl to the kernel (also defined in /usr/include/linux/mroute.h) with this information:


struct mfcctl
{
        struct in_addr mfcc_origin;             /* Origin of mcast      */
        struct in_addr mfcc_mcastgrp;           /* Group in question    */
        vifi_t  mfcc_parent;                    /* Where it arrived     */
        unsigned char mfcc_ttls[MAXVIFS];       /* Where it is going    */
};

With all this information in hand, ipmr_forward() "walks" across the VIFs, and if a matching is found it duplicates the datagram and calls ipmr_queue_xmit() which, in turn, uses the output device specified by the routing table and the proper destination address if the packet is to be sent across a tunnel (ie, the unicast destination address of the other end of the tunnel).

Function ip_rt_event() (not directly related to output, but which is in ip_output.c too) receives events related to a network device, like the device going up. This function assures that then the device joins the ALL-HOSTS multicast group.

IGMP functions are implemented in LINUX/net/ipv4/igmp.c. Important information for that functions appears in /usr/include/linux/igmp.h and /usr/include/linux/mroute.h. The IGMP entry in the /proc/net directory is created with ip_init() in LINUX/net/ipv4/ip_output.c.


8. Routing Policies and Forwarding Techniques.

One trivial algorithm to make worldwide multicast traffic available everywhere could be to send it... everywhere, despite someone wants it or not. As this does not seem quite optimized, several routing algorithms and forwarding techniques have been implemented.

DVMRP (Distance Vector Multicast Routing Protocol) is, perhaps, the one most multicast routers use now. It is a dense mode routing protocol, that is, it performs well in environments with high bandwidth and densely distributed members. However, in sparse mode scenarios, it suffers from scalability problems.

Together with DVMRP we can find other dense mode routing protocols, such as MOSPF (Multicast Extensions to OSPF -Open Shortest Path First-) and PIM-DM (Protocol-Independent Multicast Dense Mode).

To perform routing in sparse mode environments, we have PIM-SM (Protocol Independent Multicast Sparse Mode) and CBT (Core Based Trees).

OSPF version 2 is explained in RFC 1583, and MOSPF in RFC 1584. PIM-SM and [[TableOfContents]] == Karger Consistent hash == Consistent hashing는 Key의 집합을 ''K'', 슬롯의 크기를 ''N''라고 했을 때, '''N의 갯수가 바뀌더라도''' 대부분의 키들이 슬롯을 그대로 사용할 수 있는 해싱 기법을 의미한다. 슬롯이 추가되거나 삭제됐을 때, K/n만큼만 조정된다. 추가된 노드만큼 재 조정되는 것이니, consistent 하다고 할 수 있다. 다른 해쉬들은 슬롯을 변경할 경우 거의 대부부분의 key들을 재 조정 해야 한다. Consistent hash 알고리즘은 1997년 '''Karger'''가 개발했다. 웹 캐시를 구현하기 위해서 개발 했는데, 캐시 노드의 추가와 삭제에 관계없이 높은 수준의 웹 히트율을 보장한다. 분산 캐시, 분산 스토리지, 라우팅, 분산 요청 처리 등 다양한 영역에서 사용하고 있는 '''성공한''' 알고리즘이다. === Consistent hash가 필요한 경우 === N개의 캐시머신이 구성돼 있다고 가정해보자. 이 경우 해쉬 함수는 hash(0) mod ''n''으로 나타낼 수 있다. 잘 작동하지만 캐시가 추가되거나 삭제되서 n 이 변경될 경우 거의 모든 객체의 위치도 함께 변경, hash가 무용지물이 된다. 결국 처음부터 다시 캐시를 구축해야 하는데, 이미 캐시 요청이 빗발치고 있다면 시스템에 문제가 생길 수 있다. Consistent 해시를 이용해서 이러한 문제를 극복할 수 있다. === 구현 아이디어 === ==== Hash ring ==== '''Consistent'''라는 용어의 해석에 주의해야 한다. Consistent는 모든 경우에 consistent가 아닌, '''가능한 consistent'''를 유지한다는 의미다. 노드의 추가 삭제가 발생할 경우, 변경 분에 대한 밸런싱을 해줘야 하기 때문에, 컨텐츠의 재배치는 피할 수 없다. 물론 k/n은 만족해야 한다. Consistent 해시의 핵심 아이디어는 캐시 노드를 하나 이상의 value가 가리키도록 설정하는데 있다. 아래 그림을 보자.

Hash ring에는 4개의 노드가 있다. Key는 비트 오퍼레이션을 이용해서 어느 노드로 향할지를 결정할 수 있다.

노드 제거 - Fail

위 그림에서 노드 1이 빠지는 경우를 생각해보자.

1번 노드가 사라지면, 0001(0101, 1101 등도 함께)은 2번으로 향한다. 깔끔한 것 같지만, 2번 노드에 트래픽이 몰린다는 단점이 있다. 각 노드가 60%씩의 부하를 처리하고 있었다면, 2번에 120%가 몰리는 셈이다.

이 문제는 가상의 노드를 배치하는 것으로 해결 할 수 있다. (해시를 한 번 더 돌리는 방법도 있는데, 굳이 언급할 필요는 없겠다.)

가상 노드

전체 hash ring에 하나 이상의 가상 노드를 배치하는 것으로 노드가 실패했을 때의 문제를 해결할 수 있다. 2^32의 크기를 가지는 hash ring이 있다고 가정해 보자. 여기에 4개의 노드가 있는데, 각 노드를 추가 할 때 마다 4개의 가상 노드를 hash ring에 배치한다.

이렇게 하면, 요청이 가상 노드의 갯수 만큼 hash ring 전체에 분산되는 효과를 얻을 수 있다. 가상 노드들은 랜덤함수, 해시, CRC32와 같은 알고리즘을 이용해서 무작위로 hash ring에 배치를 한다. 무작위로 배치가 된다면 (이론적으로) 현재 노드 N-1다음에 N-2가 올 확률은 1/Node 갯수 이 된다.

노란 색 노드가 실패 한 경우를 가정해 보자. Hash ring 상에서 모든 노란 색노드가 삭제되고, 노란 색 노드로 향하는 key들은 노란 색 노드의 다음(next)노드로 향한다. 노란색 노드 다음에 특정 노드가 올 확률은 1/4 이므로, KEY는 1/4 만큼 남아 있는 노드로 분배될 것이다.

노드 추가 알고리즘은 아래와 같다.
// NodeName : 노드 이름
// virtualNodeNum : 가상 노드의 갯수
func Hash.AddNode(NodeName, virtualNodeNum) {
   for i := 0 ; i < virtualNodeNum ; i++ {
       NodeTable[ Hash.Get(NodeName) ] = NodeName
   }
}
알고리즘을 돌리면 대략 아래와 같은 노드 테이블이 만들어질 거다.
Node ID Node Name
2311 Node-1
4571 Node-2
6190 Node-1
8188 Node-4
9549 Node-2
9549 Node-1
10012 Node-3
11810 Node-2
12245 Node-1
13296 Node-3
해시에 client-1을 적용해서 얻은 값이 3121 이라면 2311 < 3121 <= 4571에 따라서 Node-2를 리턴한다.

노드 변경과 Key 재분배

노드를 추가하면 하나 이상의 가상노드를 만들어서 hash ring에 배치한다는 것을 확인했다. 노드가 추가되면, Key를 재 분배 해서 로드를 분산해야 한다.

이 재분배작업은 K*1/N을 만족해야 한다. K는 전체 Key의 갯수이고, N은 노드의 갯수다. 4개의 노드가 있고, key가 100개인 hash ring에 노드 하나가 추가 될 경우, 기존의 노드에서 100 * 1/5 = 20 이 재분배 된다. 노드 하나로 계산하면 5개씩이 추가된 노드로 이동해서 20씩 분산된다.

재 분배되는 key에서는 데이터를 옮기는 작업도 진행해야 한다. 캐시 서버라면, 캐시 파일을 옮겨야 할테고, 통신 연결(connection)이라면, 연결을 끊어줘야 한다.

구현

Openstack Swift

오픈스택 Swift는 파이선으로 만들어진 오브젝트 스토리지(Object storage)다. 수시로 변하는 스토리지 노드에서 오브젝트를 효과적으로 분산하고, 빠르게 찾기 위해서 consistent hash를 사용한다. Karger consistent hash 알고리즘을 기반으로, (좀 더 효율적으로 작동하도록)수정해서 사용하고 있다.
알고리즘
Hash ring은 가상노드를 배치해서 요청이 몰리는 것을 막는다. 이 방식의 단점은 sorted map을 구성해야 한다는 점이다. 모든 요청에 대해서 해시 테이블을 검색해야 한다. swift의 경우 해시 값으로 "파일의 이름"을 사용하므로, 모든 파일에 대해서 해시 테이블을 검색하는 삽질을 해야 한다. 하나의 노드가 100개의 가상노드를 가진다면, 해시 테이블의 검색에도 무시할 수 없는 비용이 추가된다.

오픈스택은 Hash ring을 이용 파티션(partition)을 만들어서 이 문제를 해결 하고 있다.(Replica는 고려하지 않았다.)

개념은 간단하다. 기존 방식으로 Hash ring을 만든 다음, 이 정보를 기반으로 인덱싱을 한 새로운 Hash ring을 만든다. 기존의 Hash ring의 경우 검색을 해야 하지만, 인덱싱한 hash ring은 O(1) 연산으로 자신의 노드를 찾을 수 있다.

효율적으로 오브젝트를 찾을 수 있는 대신
  1. 노드가 추가되면 hash ring을 새로 빌드 해야 한다.
  2. Hash ring을 구축하기 위해서 많은 메모리가 필요하다.
는 단점이 있다. 메모리를 소비해서 성능을 높이는 방식이라고 보면 되겠다.

시스템 구성

Jump consistent hash

소개

Google에서 공개한 해쉬 알고리즘이다. 표어가 A Fast, Minimal Memory, Consistent hash Algorithm. 비범한 것 같아서 살펴보기로 했다.

알고리즘은 단 5줄...
int32 _t JumpConsistentHash(uint64_t key, int32_t num_buckets) { 
	int64_t b =­-1, j = 0; 
	while (j < num_buckets) { 
		b = j; 
		key = key * 2862933555777941757ULL + 1; 
		j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1)); 
	} 
	return b; 
}  

간단한 코드를 하나 만들어서 돌렸다. 8개의 버킷에 100,000개의 키를 입력했다.
#include <iostream>
#include <stdint.h>
#include <map>

using namespace std;

int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) {
	uint64_t b =-1, j = 0;
	while (j < num_buckets) {
		b = j;
		key = key * 2862933555777941757ULL + 1;
		j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1));
	}
	return b;
}
int main(int argc, char **argv) {
	map<int, int> map1;
	map<int, int>::iterator mi;
	int32_t v;
	for (uint64_t i = 0; i < 100000; i++) {
		v = JumpConsistentHash(i, 8);
		mi = map1.find(v);
		if (mi == map1.end()) {
			map1[v] = 1;
		} else {
			mi->second++;
		}
	}
	for (mi=map1.begin(); mi !=map1.end(); ++mi) {
		cout << mi->first << " : " << mi->second << endl;
	}
}

실행 결과.
0 : 12496
1 : 12498
2 : 12503
3 : 12501
4 : 12470
5 : 12478
6 : 12496
7 : 12558
굉장히 잘 작동한다. 버킷값을 조정할 경우 K/N도 만족하는 걸 확인했다. 어떻게 작동하는지 생각해 보기로 했다.

아래는 jump consistent hash의 작동방식을 묘사한 그림이다.

말판 놀이를 한다고 가정해보자. Bucket size는 목적지까지의 거리다. Key는 말이다. 특수하게 제작된 주사위를 던져서 말을 앞으로 전진 시켜서, 몇 번만에 목적지를 뛰어넘는지를 계산한다. 물론 멀리 뛸 확률은 점점 줄어든다. 이 확률을 1/N(거리)로 맞추면, 공정한 게임판을 만들 수 있다. 목적지까지의 거리(버킷 크기)가 4라면 1, 1/2, 1/3, 1/4 의 확률을 가진다. 이 주사위를 매 판마다 던지면 된다.

이 말판 놀이의 핵심은 주사위 설계에 있다. 이 주사위는 대략 아래와 같은 모습을 가질 거다.

1을 만나기 전까지 0의 갯수가 전진 할 수 있는 칸의 숫자다.(반대로 해도 상관 없다.) 마작과 비슷한 느낌이 되려나 ? 그리고 0칸 이동(제자리)가 나올 경우에 한칸을 이동할 수 있도록 규칙을 추가한다. 그러면 가장 재수 없는 경우에도 N(거리)만큼 주사위를 던지는 걸로 목적지에 도착할 수 있을 거다. 이 주사위를 여러번 던지면, Jump 시퀀스를 만들 수 있다.

위의 5줄 짜리 코드는 앞선 설명의 구현체다. 코드를 보자.
uint64_t b =-1, j = 0;
while (j < num_buckets) {
	b = j;
	key = key * 2862933555777941757ULL + 1;
	j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1));
}
return b;
이 된다. 2862933555777941757은 2^63보다 더 큰 값이기 때문에, 이 연산은 (대부분의 경우)오버플로우 된다. 요녀석을 double((key >> 33)) 연산을 해서, double로 형변환을 하면, 랜덤한 값을 얻을 수 있다. 이 값을 분모로 해서 나누기 연산을 하면, 1/N로 확률이 감소하는 주사위를 얻을 수 있다. 버킷이 1 늘어날 경우, 기존의 Key들이 새로운 버킷으로 할당될 확률은 1/N이 되기 때문에, Consistent hash의 조건을 만족한다.

구현

Jump Consistent Hash는 아래와 같이 구현 할 수 있을 거다.

Proxy 서버가 Key를 해시함수로 돌려서 버킷을 찾아서 중계한다. 버킷의 수가 늘어날 수록 연산이 늘어날 수 있으므로(버켓이 N일 때 최대 N 번의 연산, 평균 N/2 연산), 캐쉬해두는게 좋을 것이다.

클라이언트도 proxy와 같은 해시 함수를 가지고 있으면, 버킷크기를 아는 것으로 클라이언트는 자신이 연결해야 할 버킷을 알 수 있다. Proxy는 클라이언트가 요청한 버킷 정보가 올바른지 검사해서 중계하면 된다. 이 방법은 해시 연산을 클라이언트에 분산할 수 있다는 장점이 있다. 반면, 버킷을 늘이거나 줄일 경우 K/N 만큼의 클라이언트에 대해서, 바뀐 버킷 값을 알려줘야 하는 단점이 있다.

응용

Web Cache

웹 캐시는 웹 서비스 품질을 높이려고 할 때, 가장 먼저 고려하는 요소다. CSS, 자바스크립트, 이미지, HTML과 같은 정적인 페이지를 고성능의 서버에 캐시하고 있다가 웹 서버 대신 응답해 주는 일을 한다. 캐시하기 힘든 동적인 페이지, 혹은 캐시 실패한 요청만 웹 서버로 전달된다.

Consistent hash를 이용하면, 분산 웹 캐시 시스템을 만들 수 있다.

  1. 유저의 요청은 먼저 Proxy server로 향한다.
  2. Proxy 서버는 유저가 요청한 자원(URL)이 해시테이블에 있는지 확인한다.
  3. 해시테이블에 있다면, Cache Server에 요청한다.
  4. 해시테이블에 없거나, 캐시에서 가져오는게 실패했다면 Web Server에 요청한다.
Consistent hash로 구성하면, (관리 없이)자유롭게 캐시를 늘리거나 줄일 수 있다. 대량의 멀티미디어 파일을 서비스에 특히 유용하게 써먹을 수 있을 거다.

샤딩

데이터베이스 샤드(Shard)는 검색이나 데이터베이스에서 사용하는 horizontal partition기법이다. 데이터베이스의 내용이 여러 서버 인스턴스로 저장하는 방식으로 로드를 분산한다.

Key-Value 기반의 NoSQL 데이터베이스에서 consistent hash를 이용해서 샤딩을 구현하는 경우가 많다.

참고