[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

ananyahjha93 · 2024-05-13T22:06:28Z

There is an order of magnitude difference between the losses between the two setups. @dirkgr @epwalsh can you sanity check the OLMo grad_clipping code for FSDP no_shard/DDP?

On 3 batches being sent to the model again and again, the model should overfit and the loss should go to 0. We can see that in the screenshot for vanilla pytorch run.

Comparing the two runs:

OLMo model used in both
no FSDP vs FSDP-no_shard (32 bit)
torch optim and clipping vs OLMo optim and clipping
OLMo scheduler used in both

dirkgr · 2024-05-13T22:08:06Z

Can you make this a draft PR? I don't think we'll merge this?

dirkgr · 2024-05-13T22:08:39Z

no FSDP vs FSDP-no_shard (32 bit)
torch optim and clipping vs OLMo optim and clipping

Find out which one of these makes the difference?

dirkgr · 2024-05-13T22:12:45Z

tests/grad_norm_test.py

+            reduce_dtype=torch.float32,
+            buffer_dtype=torch.float32,
+        ),
+        auto_wrap_policy=None,


Did you check how it wraps the model?

dirkgr · 2024-05-13T22:13:30Z

tests/grad_norm_test.py

+    # use same model, data, optimizer, fsdp_model and send to trainer and compare gradient clip
+
+    # olmo optimizer
+    model = OLMo(cfg.model).to('cuda')


Can you make double sure the initialization is same here?

From the looks of it, this model will definitely start with different weights than the one above. The best way to be sure would be to load starting state dict of the first model into the second model.

dirkgr · 2024-05-13T22:15:27Z

tests/grad_norm_test.py

+    # olmo optimizer
+    model = OLMo(cfg.model).to('cuda')
+    olmo_optimizer = build_optimizer(cfg, model)
+    data_loader = build_train_dataloader(cfg)


Make sure the data loader is same too? By the time you get here you might have consumed some random state, so you might get different data. I would go so far as to pre-load the data into a List, and then use that, so you can be 100% sure it's the same.

epwalsh · 2024-05-13T22:45:21Z

tests/grad_norm_test.py

+    # use same model, data, optimizer, fsdp_model and send to trainer and compare gradient clip
+
+    # olmo optimizer
+    model = OLMo(cfg.model).to('cuda')


From the looks of it, this model will definitely start with different weights than the one above. The best way to be sure would be to load starting state dict of the first model into the second model.

epwalsh · 2024-05-13T22:47:27Z

olmo/optim.py

        # Now reduce metrics over all ranks.
        total_grad_norm: torch.Tensor
        per_param_avg_metrics: List[torch.Tensor] = []
-        if is_distributed():  # TODO (epwalsh): skip for non-sharded params
+        if is_distributed() and param_group_sharded:


Will still need to reduce gradient metrics with non-sharded params since each rank will have different gradients.

won't the gradients sync after loss.backward() call?

Ah, yes you're right. My mistake

in DDP loss.backward() syncs grads, optimizer.step() updates each copy with synced grads! Or am I getting this wrong? So, if loss.backward() syncs grads, then every rank must have the same grads

@epwalsh does this apply to FSDP no_shard as well?

ananyahjha93 · 2024-05-13T22:54:39Z

Sending torch optimizer and gradient clipping with no_shard seems to fix it!

no_shard_fix branch

d29fe27

ananyahjha93 requested a review from epwalsh May 13, 2024 22:06

dirkgr reviewed May 13, 2024

View reviewed changes

ananyahjha93 marked this pull request as draft May 13, 2024 22:13

dirkgr reviewed May 13, 2024

View reviewed changes

epwalsh reviewed May 13, 2024

View reviewed changes

ananyahjha93 added 7 commits May 13, 2024 21:29

.

960f0f2

model init

f687c0a

added torch clip to olmo train code

1433d52

updated olmo clipping

e640b79

parallel training of models

94963a8

current

ddad073

.

4d31ac8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

ananyahjha93 commented May 13, 2024

dirkgr commented May 13, 2024

dirkgr commented May 13, 2024

dirkgr May 13, 2024

dirkgr May 13, 2024

epwalsh May 13, 2024

dirkgr May 13, 2024

epwalsh May 13, 2024

epwalsh May 13, 2024

ananyahjha93 May 13, 2024

epwalsh May 13, 2024

ananyahjha93 May 13, 2024

ananyahjha93 May 13, 2024

ananyahjha93 commented May 13, 2024

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

Are you sure you want to change the base?

[wip] comparing vanilla torch model and clipping with OLMo FSDP, no_shard and OLMo clipping #577

Conversation

ananyahjha93 commented May 13, 2024

dirkgr commented May 13, 2024

dirkgr commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananyahjha93 commented May 13, 2024