DeepPool

Instructions on how to run the VGG example

1. Prepare cluster configuration file

clusterConfig.json file should be ready to start servers. An example is in clusterConfigExample.json.
Upload this configuration file to the server that will serve as the cluster coordinator.
If you use genConfigForAwsExperiment.py, this configuration file will be generated & uploaded automatically to all EC2 servers.

2. Start cluster via cluster.py

Single line command
- python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl
  - Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
- If you use genConfigForAwsExperiment.py, you may copy and paste the last line of stdout.

3. How to check system logs

ssh to the machine that ran cluster.py
logs are in ~/DeepPoolRuntime/logs/
I typically use grep "" logs/*.out and grep "" logs/*.err to check how things are going.

4. Submit VGG12 training job to the cluster coordinator

ssh to any machine that reach the cluster coordinator.
Run python3 ~/DeepPoolRuntime/examples/vgg.py
Right now, runtimes will only run 1 iteration.

Instructions on automatic setup scripts for AWS

Scripts
- genConfigForAwsExperiment.py
  - genConfigForAwsExperiment.py needs to files: aws-started-publicDnsName.txt and aws-started-privateIps.txt. They can be automatically generated by aws_ec2_tools/startEC2instance.sh
- aws_ec2_tools
  - Prerequisites
    - aws-cli2 with text mode.
    - Setup a security group which opens all ports within the group.
    - A private key registered in AWS.

Enabling best-effort training (Single GPU Resnet50)

BE training requires a patched pytorch. Run build_custom_pytorch.sh in the be_training directory to download, patch, compile, and install pytorch.
Build and install the training extension by running python setup.py install in the same directory.
Control batch training using the --be_batch_size=N flag for runtime.py (0 disables training, 16 is the default).

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
aws_ec2_tools		aws_ec2_tools
beModules		beModules
be_training		be_training
csrc		csrc
examples		examples
logs		logs
microbenchmark		microbenchmark
modules		modules
profile		profile
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
cluster.py		cluster.py
clusterClient.py		clusterClient.py
clusterConfigExample.json		clusterConfigExample.json
communication.py		communication.py
genConfigForAwsExperiment.py		genConfigForAwsExperiment.py
genConfigForFastnic.py		genConfigForFastnic.py
gpuProfiler.py		gpuProfiler.py
inceptionLayerGpuProfileA100.txt		inceptionLayerGpuProfileA100.txt
inceptionLayerGpuProfileA100V2.txt		inceptionLayerGpuProfileA100V2.txt
installPythonPackages.sh		installPythonPackages.sh
jobDescription.py		jobDescription.py
logger.py		logger.py
parallelizationPlanner.py		parallelizationPlanner.py
resnetLayerGpuProfileA100.txt		resnetLayerGpuProfileA100.txt
resnetLayerGpuProfileA100V2.txt		resnetLayerGpuProfileA100V2.txt
runnableModule.py		runnableModule.py
runtime.py		runtime.py
timetrace.py		timetrace.py
vggLayerGpuProfileA100.txt		vggLayerGpuProfileA100.txt
vitLayerGpuProfileA100.txt		vitLayerGpuProfileA100.txt
wrnLayerGpuProfileA100V2.txt		wrnLayerGpuProfileA100V2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepPool

Instructions on how to run the VGG example

Instructions on automatic setup scripts for AWS

Enabling best-effort training (Single GPU Resnet50)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

seojinpark/DeepPool

Folders and files

Latest commit

History

Repository files navigation

DeepPool

Instructions on how to run the VGG example

Instructions on automatic setup scripts for AWS

Enabling best-effort training (Single GPU Resnet50)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages