tar
does preserve the hardlinks.
Here's a sample directory with three hard-linked files and one file with a single link:
foo% vdir .
total 16
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 bar.txt
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 foo.txt
-rw-r--r-- 3 kostix kostix 5 Jul 12 19:37 test.txt
-rw-r--r-- 1 kostix kostix 9 Jul 12 19:49 xyzzy.txt
Now we archive it using GNU tar
and verify it indeed added the links
(because we didn't pass it the --hard-dereferece
command-line option):
foo% tar -cf ../foo.tar .
foo% tar -tvf ../foo.tar
drwxr-xr-x kostix/kostix 0 2016-07-12 19:49 ./
-rw-r--r-- kostix/kostix 9 2016-07-12 19:49 ./xyzzy.txt
-rw-r--r-- kostix/kostix 5 2016-07-12 19:37 ./bar.txt
hrw-r--r-- kostix/kostix 0 2016-07-12 19:37 ./test.txt link to ./bar.txt
hrw-r--r-- kostix/kostix 0 2016-07-12 19:37 ./foo.txt link to ./bar.txt
The documentation of archive/tar
refers to a bunch of documents defining the standard on the tar
archive (and unfortunately, there's no a single standard: for instance, GNU tar does not support POSIX extended attributes, while BSD tar (which relies on libarchive
) does, and so does pax
).
To cite its bit on the hardlinks:
LNKTYPE
This flag represents a file linked to another file, of any type,
previously archived. Such files are identified in Unix by each file having
the same device and inode number. The linked-to name is specified in the
linkname field with a trailing null.
So, a hadrlink is an enrty of a special type ('1') which refers to some
preceding (already archived) file by its name.
So let's create a playground example.
We base64-encode our archive:
foo% base64 <../foo.tar | xclip -selection clipboard
…and write the code.
The archive contains a single directory, one file (type '0') another file (type '0') followed by two hardlinks (type '1') to it.
The output from the playground example:
Archive entry '5': ./
Archive entry '0': ./xyzzy.txt
Archive entry '0': ./bar.txt
Archive entry '1': ./test.txt link to ./bar.txt
Archive entry '1': ./foo.txt link to ./bar.txt
So your link-counting code should:
Scan the entire archive record-by-record.
-
Remember any regular file (type archive/tar.TypeReg
or type archive/tar.TypeRegA
) already processed, and have a counter associated with it, which starts at 1.
Well, in reality, you'd better be exclusive and record entries
of all types except symbolic links and directories — because tar
archives can contain nodes for character and block devices, and FIFOs (named pipes).
-
When you encounter a hard link (type archive/tar.TypeReg
),
- Read the
Linkname
field of its header.
- Look your list of "seen" files up and increase the counter
of its entry which matches that name.
Update
As the OP actually wanted to know how to manage hardlinks on the
source filesystem, here's the update.
The chief idea is that on a filesystem with POSIX semantics:
-
A directory entry designating a file actually points to a special
filesystem metadata block called "inode".
The inode contains the number of directory entries
pointing to it.
Creating a hardlink is actually just:
- Creating a new directory entry pointing to the inode
of the original (source) file — "the link target" in the
ln
s terms.
- Incrementing the link counter in that inode.
-
Hence any file is uniquely identified by two integer numbers:
the "device number" identifying the physical device hosting the filesystem
on which the file is located, and inode number identifying the file's data.
It follows, that if two files have the same (device, inode) pairs,
they represent the same content. Or, if we put it differently, one
is a hardlink to the other.
So, adding files to a tar
archive while preserving the hardlinks works this way:
Having added a file, save its (device, inode) pair to some lookup table.
-
When adding another file, figure out its (device, inode) pair and
look it up in that table.
If a matching entry is found, the file's data was already streamed,
and we should add a hardlink.
Otherwise, behave as in step (1).
So here's the code:
package main
import (
"archive/tar"
"io"
"log"
"os"
"path/filepath"
"syscall"
)
type devino struct {
Dev uint64
Ino uint64
}
func main() {
log.SetFlags(0)
if len(os.Args) != 2 {
log.Fatalf("Usage: %s DIR
", os.Args[0])
}
seen := make(map[devino]string)
tw := tar.NewWriter(os.Stdout)
err := filepath.Walk(os.Args[1],
func(fn string, fi os.FileInfo, we error) (err error) {
if we != nil {
log.Fatal("Error processing directory", we)
}
hdr, err := tar.FileInfoHeader(fi, "")
if err != nil {
return
}
if fi.IsDir() {
err = tw.WriteHeader(hdr)
return
}
st := fi.Sys().(*syscall.Stat_t)
di := devino{
Dev: st.Dev,
Ino: st.Ino,
}
orig, ok := seen[di]
if ok {
hdr.Typeflag = tar.TypeLink
hdr.Linkname = orig
hdr.Size = 0
err = tw.WriteHeader(hdr)
return
}
fd, err := os.Open(fn)
if err != nil {
return
}
err = tw.WriteHeader(hdr)
if err != nil {
return
}
_, err = io.Copy(tw, fd)
fd.Close() // Ignoring error for a file opened R/O
if err == nil {
seen[di] = fi.Name()
}
return err
})
if err != nil {
log.Fatal(err)
}
err = tw.Close()
if err != nil {
log.Fatal(err)
}
return
}
Note that it's quite inadequate:
It improperly deals with file and directory names.
It does not attempt to properly work with symlinks and FIFOs,
and skip Unix-domain sockets etc.
-
It assumes it works in a POSIX environment.
On non-POSIX systems, the Sys()
method called on a value of type
os.FileInfo
might return something else rather than the POSIX'y
syscall.Stat_t
.
Say, on Windows, there are multiple filesystems hosted by different
"disks" or "drives". I have no idea how Go handles that.
Maybe the "device number" had to be emulated somehow for this case.
On the other hand, it shows how to handle hardlinks:
- Set the "Linkname" field of the header struct.
- Reset the "Size" field of the header to 0 (because no data will follow).
You might also want to use another approach to maintain the lookup table: if most of your files are expected to be located on the same physical filesystem, each entry wastes an uint64
for the device number of each entry. So a hierarchy of maps might be a sensible thing to do: the first maps device numbers to another map which maps inode numbers to file names.