1. Prepare cluster configuration file
clusterConfig.jsonfile should be ready to start servers. An example is inclusterConfigExample.json.- Upload this configuration file to the server that will serve as the cluster coordinator.
- If you use
genConfigForAwsExperiment.py, this configuration file will be generated & uploaded automatically to all EC2 servers.
2. Start cluster via cluster.py
- Single line command
python3 cluster.py --addrToBind <this_server's_addr>:<port_to_listen> --c10dBackend nccl- Cluster coordinator will listen on <this_server's_addr>:<port_to_listen>. Cluster clients will contact to this address and port for subnitting training jobs. On AWS, make sure that this is a private ip, not a public ip.
- If you use
genConfigForAwsExperiment.py, you may copy and paste the last line of stdout.
3. How to check system logs
- ssh to the machine that ran cluster.py
- logs are in
~/DeepPoolRuntime/logs/ - I typically use
grep "" logs/*.outandgrep "" logs/*.errto check how things are going.
4. Submit VGG12 training job to the cluster coordinator
- ssh to any machine that reach the cluster coordinator.
- Run
python3 ~/DeepPoolRuntime/examples/vgg.py - Right now, runtimes will only run 1 iteration.
- Scripts
- genConfigForAwsExperiment.py
genConfigForAwsExperiment.pyneeds to files:aws-started-publicDnsName.txtandaws-started-privateIps.txt. They can be automatically generated byaws_ec2_tools/startEC2instance.sh
- aws_ec2_tools
- Prerequisites
- aws-cli2 with text mode.
- Setup a security group which opens all ports within the group.
- A private key registered in AWS.
- Prerequisites
- genConfigForAwsExperiment.py
- BE training requires a patched pytorch. Run
build_custom_pytorch.shin thebe_trainingdirectory to download, patch, compile, and install pytorch. - Build and install the training extension by running
python setup.py installin the same directory. - Control batch training using the
--be_batch_size=Nflag forruntime.py(0 disables training, 16 is the default).