facebookresearch · aakashapoorv · Apr 29, 2024 · Apr 29, 2024
diff --git a/docs/source/deep_dive/oss_sdp_fsdp.rst b/docs/source/deep_dive/oss_sdp_fsdp.rst
@@ -7,7 +7,7 @@ that aim to tackle the tradeoff between using Data Parallel training and Model P
 When using Data Parallel training, you tradeoff memory for computation/communication efficiency.
 On the other hand, when using Model Parallel training, you tradeoff computation/communication
 efficiency for memory. ZeRO attempts to solve this problem. Model training generally involves memory
-footprints that falls into two categories:
+footprints that fall into two categories:
 
 1. Model states - optimizer states, gradients, parameters
 

diff --git a/docs/source/deep_dive/pipeline_parallelism.rst b/docs/source/deep_dive/pipeline_parallelism.rst
@@ -12,7 +12,7 @@ Gpipe first shards the model across different devices where each device hosts a
 A shard can be a single layer or a series of layers. However Gpipe splits a mini-batch of data into
 micro-batches and feeds it to the device hosting the first shard. The layers on each device process
 the micro-batches and send the output to the following shard/device. In the meantime it is ready to
-process the micro batch from the previous shard/device. By pipepling the input in this way, Gpipe is
+process the micro batch from the previous shard/device. By pipelining the input in this way, Gpipe is
 able to reduce the idle time of devices.
 
 Best practices for using `fairscale.nn.Pipe`