In this post, we delve into how to create a robust workflow for predicting a baby’s weight using structured data and Google Cloud's Vertex AI Platform. From data preparation to deploying a machine learning (ML) model, we’ll explore the process step by step.
Step 1: Set Up Google Cloud Environment
The first step is preparing your cloud environment by setting up:
BigQuery Dataset: A container for storing and analyzing large datasets.
Google Cloud Storage (GCS) Bucket: A storage location for model training artifacts and datasets.
Google Cloud Storage (GCS) Bucket: A storage location for model training artifacts and datasets.
Key Commands:
Use the
bq
command-line tool to create the BigQuery dataset.Employ
gsutil
to create and manage GCS buckets.
Step 2: Preprocessing the Dataset
We used the CDC’s publicly available natality dataset in BigQuery. After filtering and preprocessing, we extracted relevant columns such as weight_pounds
, is_male
, mother_age
, plurality
, and gestation_weeks
.
Preprocessing Steps:
Convert
is_male
to a string for better compatibility.Reformat
plurality
values into human-readable categories like "Twins(2)" or "Triplets(3)."Exclude invalid entries (e.g., weights <= 0 or ages <= 0).
Simulated Missing Data:
We augmented the dataset by introducing rows where:
Gender (
is_male
) is set to "Unknown."Non-single births were grouped as "Multiple(2+)."
This step prepares the model to handle missing data scenarios.
Step 3: Train-Test Data Split
The dataset was split into training (75%) and evaluation (25%) subsets using a hashing method for consistency and randomness.
Step 4: Export Data to GCS
BigQuery Python API was used to export the train and evaluation subsets as CSV files to GCS. These CSVs serve as input data for the TensorFlow/Keras-based model training.
Commands:
gsutil ls gs://${BUCKET}/babyweight/data/*.csv
This command verifies the creation of CSV files.
Step 5: Train the Model
Training was conducted on Google Cloud AI Platform using the TensorFlow framework.
Parameters:
Epochs: 10
Batch Size: 32
Machine Type:
n1-standard-8
Command:
gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv
Step 6: Deploy the Model
The trained model, saved as saved_model.pb
in the GCS bucket, was deployed to Google Cloud AI Platform. This deployment creates a REST API endpoint for making predictions.
Deployment Command:
gcloud ai-platform models create ${MODEL_NAME} --regions ${REGION}
gcloud ai-platform versions create ${MODEL_VERSION} \
--model=${MODEL_NAME} \
--origin=${MODEL_LOCATION} \
--runtime-version=2.6 \
--python-version=3.7
Conclusion
With the model deployed, you can now send requests to the REST endpoint to predict baby weights. This workflow demonstrates the power of Google Cloud tools in managing every stage of the ML lifecycle—from preprocessing and training to deployment.