-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): sklearn
- Framework Version:
- Python Version: 3.5
- CPU or GPU: CPU
- Python SDK Version:
- Are you using a custom image: Non
Describe the problem
I am trying to deploy a logistic regression model with sagemaker sklearn. When I train with 1/10 of the data I can deploy without problem using the commands below. When I train with all the data, the training is OK and my model is around 800mo . But the deployment is falling with these erros
Minimal repro / logs
"in the jupyter notebook"
ValueError: Error hosting endpoint sagemaker-scikit-learn-2019-01-17-12-59-16-371: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.
"in the clouwatch console"
2019/01/17 14:29:00 [error] 25#25: *47 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.32.0.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/ping", host: "model.aws.local:8080"
- Exact command to reproduce:
from sagemaker.sklearn.estimator import SKLearn
script_path = 'sklearn_sentiment.py'
sklearn_preprocessor = SKLearn(
entry_point=script_path,
role=role,
train_instance_type="ml.m4.4xlarge",
sagemaker_session=sagemaker_session)
sklearn_preprocessor.fit({'train' : data_location})
predictor = sklearn_preprocessor.deploy(initial_instance_count=1, instance_type="ml.c5.4xlarge")