add --nocopy does not always add blocks to the filestore + proposal to make it do so
#5988 opened on Feb 12, 2019
Description
Version information: 0.4.19-dev
Type: enhancement
Description:
add --nocopy should add files to the filestore instead of the blockstore, but it doesn't always do so. Blocks are not added to the filestore if the same blocks happen to be in the blockstore. In this case, the blocks won't end up in the filestore, and so won't be seen in filestore ls. This hinders the user from achieving non-duplicated storage of data: the user ends up with the data in the original files AND in the blockstore, despite having used --nocopy.
To avoid the duplication one might think to unpin the roots existing pior to the add and refering to the leaves that match the blocks that are about to be added, and then repo gc to clear these leaves from the blockstore, before running add --nocopy, but
GC does not always cleanup the blocks in the filestore. I've seen block X in the(UPDATE: Blocks were held by MFS:filestore ls,pin ls Xsaying 'not pinned', thenrepo gc, and yetfilestore ls Xstill finds it. I've also seen GC remove unpinned blocks from the filestore, so perhaps the former behavior is just a bug?ipfs files)- This might not be a general solution, because the leaves might be shared with unrelated roots R. As long as R is pinned, this is fine (I'm not looking to re-point R's leaf pointers to the filestore.), but after R is unpinned, we have the problematic duplication: data in raw filesystem file and same data in the blockstore. In the extreme case, this problem can make nocopy a noop (save net zero space after counting the raw filesystem files; instead of adding the negligible cost of intermediate nodes for the benefit of publishing the data to IPFS).
- This approach cannot be applied retroactively after the add -- if you realize that the blocks existed in the blockstore after the add, then you have to clean them and then re-run
add --nocopyagain. This is less intuitive than having to clean up duplicates after finding out that there are duplicates.
Proposal: allow the add to duplicate between filestore and blockstore, but provide tools to manage this. If user requests something be added to the filestore with --nocopy, then add it unconditionally, but also give them a block rm option to remove from blockstore or filestore explicitly. The removal would only be allowed if the block is not referenced by a pinned parent. This check already happens, but with the proposed change the references need to be distinguished by the backend type of the block, not only by the block hash, because obviously the block hash is going to be referenced by the parent that was just added with --nocopy.
Then, to get into a state with only one copy of the data (in the original filesystem files): ipfs filestore dups | xargs -n 1 ipfs block rm --from-blockstore.
Both parts seem doable. The patch to the first part (I'm already using it):
--- a/filestore/filestore.go
+++ b/filestore/filestore.go
@@ -169,15 +169,6 @@ func (f *Filestore) GetSize(c cid.Cid) (int, error) {
// Has returns true if the block with the given Cid is
// stored in the Filestore.
func (f *Filestore) Has(c cid.Cid) (bool, error) {
- has, err := f.bs.Has(c)
- if err != nil {
- return false, err
- }
-
- if has {
- return true, nil
- }
-
return f.fm.Has(c)
}