# LitMatter DeepChem
* 这本笔记本展示了如何使用 LitMatter 模板在 [MoleculeNet](https://arxiv.org/abs/1703.00564) 数据集上加速 [DeepChem](https://github.com/deepchem/deepchem) 模型训练。
* 在本例中，我们在 Tox21 数据集上训练一个简单的 DeepChem `TorchModel` 。
* 这里展示的训练工作流可以通过更改一个关键参数扩展到数百个 GPU！

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch

In [3]:
import deepchem as dc

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import (LightningDataModule, LightningModule, Trainer,
                               seed_everything)

### 加载一个 `LitMolNet` 数据集
任何来自 `deepchem.molnet` 的 MolNet 数据集可与 LitMatter 配合使用。具体的 MolNet 数据集和任何预处理步骤都可以在 `data.LitMolNet` 中定义。

In [4]:
from lit_data.molnet_data import LitMolNet

dm = LitMolNet(loader=dc.molnet.load_tox21, batch_size=16)
dm.prepare_data()
dm.setup()



### 实例化一个 `LitDeepChem` 模型
任何 `deepchem.models.torch_models.TorchModel` 可以与 LitMatter 一起使用。在这里，我们将在 PyTorch 中编写我们自己的自定义基本模型，并创建一个 `TorchModel` 。

In [9]:
from lit_models.deepchem_models import LitDeepChem

base_model = torch.nn.Sequential(
torch.nn.Linear(1024, 256),
    torch.nn.ReLU(),
    torch.nn.Linear(256, 12),
)

torch_model = dc.models.TorchModel(base_model, loss=torch.nn.MSELoss())

model = LitDeepChem(torch_model, lr=1e-2)

Exception ignored in: <function Model.__del__ at 0x7f028f6c6550>
Traceback (most recent call last):
  File "/home/gridsan/NA30490/.conda/envs/litmatter/lib/python3.8/site-packages/deepchem/models/models.py", line 61, in __del__
    shutil.rmtree(self.model_dir)
  File "/home/gridsan/NA30490/.conda/envs/litmatter/lib/python3.8/shutil.py", line 709, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/home/gridsan/NA30490/.conda/envs/litmatter/lib/python3.8/shutil.py", line 707, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/state/partition1/slurm_tmp/48690281.0.0/tmp60c4fwyb'


### 训练模型
在用多个 GPU 和多节点训练时，只需根据需要更改 `Trainer` 标志。

In [11]:
trainer = Trainer(gpus=-1,  # use all available GPUs on each node
#                   num_nodes=1,  # change to number of available nodes
#                  accelerator='ddp',
                 max_epochs=5,
                 )

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [12]:
trainer.fit(model, datamodule=dm)

  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-13636255-d3c9-b0ac-83c7-b25c82e0dbc5]
Set SLURM handle signals.

  | Name    | Type       | Params
---------------------------------------
0 | model   | Sequential | 265 K 
1 | loss_fn | MSELoss    | 0     
---------------------------------------
265 K     Trainable params
0         Non-trainable params
265 K     Total params
1.062     Total estimated model params size (MB)
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]



Epoch 0: : 392it [00:01, 225.99it/s, loss=14.3, v_num=4.87e+7]        




Validating: 0it [00:00, ?it/s][A
Validating: 0it [00:00, ?it/s][A
Epoch 0: : 401it [00:01, 215.30it/s, loss=14.3, v_num=4.87e+7]
Epoch 0: : 442it [00:01, 223.67it/s, loss=14.3, v_num=4.87e+7, val_loss=19.20]
Epoch 1: : 0it [00:00, ?it/s, loss=14.3, v_num=4.87e+7, val_loss=19.20]       



Epoch 1: : 392it [00:01, 233.84it/s, loss=9.79, v_num=4.87e+7, val_loss=19.20, train_loss=13.20]
Validating: 0it [00:00, ?it/s][A
Validating: 0it [00:00, ?it/s][A
Epoch 1: : 442it [00:01, 235.14it/s, loss=9.79, v_num=4.87e+7, val_loss=19.00, train_loss=13.20]
Epoch 2: : 392it [00:01, 264.65it/s, loss=13.7, v_num=4.87e+7, val_loss=19.00, train_loss=11.00]
Validating: 0it [00:00, ?it/s][A
Validating: 0it [00:00, ?it/s][A
Epoch 2: : 442it [00:01, 260.55it/s, loss=13.7, v_num=4.87e+7, val_loss=18.90, train_loss=11.00]
Epoch 3: : 392it [00:01, 261.71it/s, loss=7.03, v_num=4.87e+7, val_loss=18.90, train_loss=8.930]
Validating: 0it [00:00, ?it/s][A
Validating: 0it [00:00, ?it/s][A
Epoch 3: : 442it [00:01, 259.93it/s, loss=7.03, v_num=4.87e+7, val_loss=19.30, train_loss=8.930]
Epoch 4: : 392it [00:01, 268.69it/s, loss=6.4, v_num=4.87e+7, val_loss=19.30, train_loss=7.330] 
Validating: 0it [00:00, ?it/s][A
Validating: 0it [00:00, ?it/s][A
Epoch 4: : 442it [00:01, 267.11it/s, loss=6.4, v_

就这样！通过改变 `num_nodes` 参数，训练可以分布在所有可用的 GPU 上。有关 HPC 集群上较长的训练作业，请参阅提供的示例批处理脚本。