deepchem/deepchem

Error Saving and Restoring GraphConvModel

Open

#1,943 opened on Jul 2, 2020

View on GitHub
 (0 comments) (0 reactions) (0 assignees)Python (4,378 stars) (1,500 forks)batch import
bughelp wanted

Description

Here's a simple failing test script

import numpy as np                                                     
import tempfile
import unittest 
# Generate dummy dataset                                               
np.random.seed(123)
import logging                                                         
logging.basicConfig(level=logging.INFO)
import deepchem as dc

from rdkit import Chem
smiles = ["C"]
mols = [Chem.MolFromSmiles(smile) for smile in smiles]                 
feat = dc.feat.ConvMolFeaturizer()                                     
X = feat.featurize(mols)
y = np.random.rand(1, 1)
train_dataset = dc.data.NumpyDataset(X, y, ids=smiles)                 
                                                                       
transformers = []                                                      
metric = dc.metrics.Metric(                                            
    dc.metrics.mean_squared_error, task_averager=np.mean)              
                                                                       
model = dc.models.GraphConvModel(n_tasks=1, mode="regression", dropout=0.5, model_dir="./graphconv_model")
model.fit(train_dataset, nb_epoch=1)                                   
orig_score = model.evaluate(train_dataset, [metric], transformers)
print("orig_score")
print(orig_score)
    
new_model = dc.models.GraphConvModel(n_tasks=1, mode="regression", dropout=0.5, model_dir="./graphconv_model")                                 
new_model.restore()                                                    
    
new_score = new_model.evaluate(train_dataset, [metric], transformers)
print("new_score")
print(new_score)

assert orig_score == new_score

Here's the printouts that result from this script

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "                                                                       
INFO:deepchem.models.keras_model:Ending global_step 1: Average loss 0.00475866
INFO:deepchem.models.keras_model:TIMING: model fitting took 5.224 s
INFO:deepchem.metrics:computed_metrics: [0.5235924904790688]
orig_score
{'mean-mean_squared_error': 0.5235924904790688}
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Py
thon program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'private__graph_conv_keras_model_1/graph_conv_2/kernel:0' shape=(75, 64) 
dtype=float32, numpy=
array([[ 4.4995248e-03,  6.8631753e-02, -8.2775950e-05, ...,
         9.0061918e-02, -5.3373262e-02,  1.2269877e-01],
       [ 9.3627259e-02,  6.1479136e-02,  1.3002183e-01, ...,
        -4.3840557e-03, -8.5053414e-02,  3.7676141e-02],
       [-1.6051233e-01,  6.1249837e-02,  1.3138755e-01, ...,
         1.6568883e-01, -1.2686929e-01, -1.7409901e-01],
       ...,
       [ 1.3316955e-01,  9.1872528e-02,  7.7954307e-02, ...,
        -1.8381983e-02, -1.9298962e-01, -8.3953649e-02],
       [ 9.0767816e-02,  1.6689555e-01,  2.8418824e-02, ...,
        -1.8065323e-01,  5.5107340e-02, -1.9644000e-01],
       [-1.4626555e-01, -6.9499269e-02,  9.4810143e-02, ...,
         1.6276342e-01, -3.9090957e-02,  2.0000762e-01]], dtype=float32)> and <tf.Variable 'private__graph_conv_keras_model_1/graph_conv_2/kern
el:0' shape=(75, 64) dtype=float32, numpy=
array([[ 0.01428294,  0.12348534,  0.16362987, ...,  0.14605765,
         0.1838846 ,  0.02513948],
       [ 0.08537485,  0.0934632 , -0.02142404, ...,  0.13571231,
        -0.11089277, -0.02331042],
       [-0.03228797,  0.04885544,  0.04713757, ...,  0.13593303,
        -0.01467833,  0.00903808],
       ...,
       [ 0.18738167, -0.17059079,  0.0418079 , ..., -0.01394288,
         0.13641016, -0.05478275],
       [ 0.05549066, -0.18584336,  0.06955038, ..., -0.14905915,
        -0.14295515, -0.08938611],
       [-0.08604956, -0.08486389,  0.11554165, ...,  0.11536823,
         0.17340572,  0.03584954]], dtype=float32)>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Python program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'private__graph_conv_keras_model_1/graph_conv_2/bias:0' shape=(64,) dtype
=float32, numpy=
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        , 
        0.        ,  0.        ,  0.00097008,  0.        , -0.00099363, 
       -0.00097946, -0.00098862,  0.00097581,  0.        ,  0.        , 
       -0.0008924 ,  0.        ,  0.        ,  0.        ,  0.        , 
        0.        ,  0.00082348,  0.00078612,  0.        ,  0.        , 
        0.00090047,  0.0008615 ,  0.        ,  0.        ,  0.        , 
        0.        ,  0.        , -0.00097074,  0.        , -0.00073872, 
       -0.0008829 , -0.00096245, -0.00087156, -0.00089219,  0.        , 
       -0.00089659, -0.00081582,  0.        ,  0.        , -0.00094864, 
       -0.00083577,  0.        ,  0.00091346,  0.        ,  0.00075098, 
        0.00093225, -0.00094252, -0.00094864,  0.        ,  0.        , 
        0.        ,  0.0009519 ,  0.        ,  0.        , -0.00093958, 
        0.        , -0.00095233,  0.00015859,  0.000994  ], dtype=float32)> and <tf.Variable 'private__graph_conv_keras_model_1/graph_conv_2/bi
as:0' shape=(64,) dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Py
thon program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'private__graph_conv_keras_model_1/graph_conv_3/kernel:0' shape=(64, 64) 
dtype=float32, numpy=
array([[-0.08082333, -0.14572376, -0.19245544, ..., -0.20733352,
        -0.0442199 , -0.11743415],
       [ 0.07426409,  0.1026492 ,  0.02717344, ...,  0.02840988,
        -0.15217997,  0.02984542],
       [-0.11081054,  0.06778781, -0.20188095, ..., -0.0791709 ,
        -0.17596933, -0.20975627],
       ...,
       [-0.14444749, -0.04584648, -0.03178209, ...,  0.10852   ,
        -0.00810872,  0.05565516],
       [ 0.08839096,  0.21500859,  0.06311794, ...,  0.1613014 ,
        -0.2150805 , -0.15202223],
       [ 0.14957619,  0.21731606,  0.21284355, ..., -0.08467104,
         0.08962891,  0.1785949 ]], dtype=float32)> and <tf.Variable 'private__graph_conv_keras_model_1/graph_conv_3/kernel:0' shape=(64, 64) d
type=float32, numpy=
array([[-0.1336613 , -0.1270378 , -0.0345715 , ..., -0.06275596,
         0.1497819 ,  0.01383078],
       [-0.18531801,  0.18253906,  0.02693482, ...,  0.05082844,
        -0.17340271,  0.15489809],
       [-0.0875622 ,  0.20997234, -0.12924384, ...,  0.1812403 ,
         0.19916232,  0.047437  ],
       ...,
       [-0.01278952, -0.20331962, -0.05146629, ...,  0.00744121,
        -0.02356011, -0.14106297],
       [ 0.02996522,  0.07338057, -0.08159059, ..., -0.10143289,
         0.0085362 ,  0.11073922],
       [-0.1267096 ,  0.06842761, -0.14939544, ...,  0.08098547,
        -0.14479446, -0.17727807]], dtype=float32)>).
WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Py
thon program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'private__graph_conv_keras_model_1/graph_conv_3/bias:0' shape=(64,) dtype
=float32, numpy=
array([ 0.001     ,  0.001     , -0.001     ,  0.00099999,  0.001     , 
        0.001     , -0.001     , -0.00099998,  0.001     ,  0.00010065, 
        0.001     , -0.001     , -0.00099999,  0.001     ,  0.001     , 
        0.001     ,  0.00099999,  0.001     ,  0.001     ,  0.001     , 
        0.001     , -0.001     , -0.00099999, -0.001     , -0.001     , 
        0.00099999,  0.001     , -0.001     , -0.001     ,  0.001     , 
        0.001     ,  0.001     ,  0.001     ,  0.001     ,  0.001     , 
        0.00099999,  0.001     , -0.001     ,  0.00099995,  0.001     , 
        0.001     , -0.00099996,  0.001     , -0.00099995, -0.001     , 
        0.001     , -0.001     , -0.001     , -0.001     ,  0.001     , 
        0.001     ,  0.001     , -0.00099999, -0.001     , -0.001     , 
        0.00099999,  0.001     , -0.00099998,  0.001     ,  0.00099999, 
        0.001     ,  0.001     , -0.001     , -0.001     ], dtype=float32)> and <tf.Variable 'private__graph_conv_keras_model_1/graph_conv_3/bi
as:0' shape=(64,) dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>).
INFO:deepchem.metrics:computed_metrics: [0.3948746252299107]
new_score
{'mean-mean_squared_error': 0.3948746252299107}
Traceback (most recent call last):
  File "graphconv_save_and_restore.py", line 36, in <module>
    assert orig_score == new_score
AssertionError

I came across this while trying to debug what was happening in https://github.com/deepchem/deepchem/pull/1878. I'm fairly sure that this is a bug in our save/restore code in general. It looks somehow like the restore isn't happening properly?

I'm a little stumped at this point. Do any of you folks have ideas on what could be behind this?

CC @peastman @vsomnath

Contributor guide