Skip to content

Failing to deploy with a 800 Mo sklearn model  #605

@sylvainrobbiano

Description

@sylvainrobbiano

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): sklearn
  • Framework Version:
  • Python Version: 3.5
  • CPU or GPU: CPU
  • Python SDK Version:
  • Are you using a custom image: Non

Describe the problem

I am trying to deploy a logistic regression model with sagemaker sklearn. When I train with 1/10 of the data I can deploy without problem using the commands below. When I train with all the data, the training is OK and my model is around 800mo . But the deployment is falling with these erros

Minimal repro / logs

"in the jupyter notebook"
ValueError: Error hosting endpoint sagemaker-scikit-learn-2019-01-17-12-59-16-371: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

"in the clouwatch console"
2019/01/17 14:29:00 [error] 25#25: *47 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.32.0.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/ping", host: "model.aws.local:8080"

  • Exact command to reproduce:
    from sagemaker.sklearn.estimator import SKLearn

script_path = 'sklearn_sentiment.py'

sklearn_preprocessor = SKLearn(
entry_point=script_path,
role=role,
train_instance_type="ml.m4.4xlarge",
sagemaker_session=sagemaker_session)
sklearn_preprocessor.fit({'train' : data_location})
predictor = sklearn_preprocessor.deploy(initial_instance_count=1, instance_type="ml.c5.4xlarge")

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions