I'm looking for a way to mirror storage devices over the internet in an efficient way. Here are features I'd like to see:
1.) Efficient use of bandwidth. This means that doing a simple once-daily rsync would not be ideal, because if I had ten nodes in the "distributed cluster," each node would have to download all the changes from the "master node." I'd rather see the node where the changes were made send 1/9th of the changes out to the other nodes, and have those 9 nodes share the data in a similar way to make most efficient use of available bandwidth. The paths between nodes can be analyzed before the transfer and the data can be split into chunks according to who can handle the most data.
2.) No "master/slave" or "primary/secondary" relationship. I'd like for the operator of each node to be able to make changes to each storage device, and have the changes immediately pushed out to the other nodes in the cluster.
3.) No real need for process sharing, so this isn't exactly a High Availability thing in the usual sense where if one machine goes down, the running processes kick in on another machine.
The best I can think of so far is drdb. I can mirror devices over a network with drdb, but I can only have one device be the "primary" device that can be directly read from / written to at any given time. (I learned this from the Gentoo How-To and haven't confirmed this from other sources.) Also, I don't expect it would make efficient use of a network. rsync is not ideal enough because while it is efficient in handling data changes, it is not efficient in mirroring data to multiple nodes and is not automatic enough.
Do you guys know of any Linux gems that I may be overlooking that can perform some of these functions? Alternatively, is there anyone that uses drdb to do something similar?
Thanks!
EDIT: Just read this on the drdb website: "Since DRBD-8.0.0 you can run both nodes in the primary role, enabling to mount a cluster file system (a physical parallel file system) one both nodes concurrently."
That's more like what I want! This is the bare minimum that would satisfy me, so I might be able to get by with drdb. Not sure about the rest or if it can be achieved with off-the-shelf tools.
1.) Efficient use of bandwidth. This means that doing a simple once-daily rsync would not be ideal, because if I had ten nodes in the "distributed cluster," each node would have to download all the changes from the "master node." I'd rather see the node where the changes were made send 1/9th of the changes out to the other nodes, and have those 9 nodes share the data in a similar way to make most efficient use of available bandwidth. The paths between nodes can be analyzed before the transfer and the data can be split into chunks according to who can handle the most data.
2.) No "master/slave" or "primary/secondary" relationship. I'd like for the operator of each node to be able to make changes to each storage device, and have the changes immediately pushed out to the other nodes in the cluster.
3.) No real need for process sharing, so this isn't exactly a High Availability thing in the usual sense where if one machine goes down, the running processes kick in on another machine.
The best I can think of so far is drdb. I can mirror devices over a network with drdb, but I can only have one device be the "primary" device that can be directly read from / written to at any given time. (I learned this from the Gentoo How-To and haven't confirmed this from other sources.) Also, I don't expect it would make efficient use of a network. rsync is not ideal enough because while it is efficient in handling data changes, it is not efficient in mirroring data to multiple nodes and is not automatic enough.
Do you guys know of any Linux gems that I may be overlooking that can perform some of these functions? Alternatively, is there anyone that uses drdb to do something similar?
Thanks!
EDIT: Just read this on the drdb website: "Since DRBD-8.0.0 you can run both nodes in the primary role, enabling to mount a cluster file system (a physical parallel file system) one both nodes concurrently."
That's more like what I want! This is the bare minimum that would satisfy me, so I might be able to get by with drdb. Not sure about the rest or if it can be achieved with off-the-shelf tools.