Structured data prediction using Vertex AI Platform

Structured data prediction using Vertex AI Platform

In this post, we delve into how to create a robust workflow for predicting a baby’s weight using structured data and Google Cloud's Vertex AI Platform. From data preparation to deploying a machine learning (ML) model, we’ll explore the process step by step.


Step 1: Set Up Google Cloud Environment

The first step is preparing your cloud environment by setting up:

  1. BigQuery Dataset: A container for storing and analyzing large datasets.

  2. Google Cloud Storage (GCS) Bucket: A storage location for model training artifacts and datasets.

Google Cloud Storage (GCS) Bucket: A storage location for model training artifacts and datasets.

Key Commands:

  • Use the bq command-line tool to create the BigQuery dataset.

  • Employ gsutil to create and manage GCS buckets.


Step 2: Preprocessing the Dataset

We used the CDC’s publicly available natality dataset in BigQuery. After filtering and preprocessing, we extracted relevant columns such as weight_pounds, is_male, mother_age, plurality, and gestation_weeks.

Preprocessing Steps:

  • Convert is_male to a string for better compatibility.

  • Reformat plurality values into human-readable categories like "Twins(2)" or "Triplets(3)."

  • Exclude invalid entries (e.g., weights <= 0 or ages <= 0).

Simulated Missing Data:

We augmented the dataset by introducing rows where:

  • Gender (is_male) is set to "Unknown."

  • Non-single births were grouped as "Multiple(2+)."

This step prepares the model to handle missing data scenarios.


Step 3: Train-Test Data Split

The dataset was split into training (75%) and evaluation (25%) subsets using a hashing method for consistency and randomness.


Step 4: Export Data to GCS

BigQuery Python API was used to export the train and evaluation subsets as CSV files to GCS. These CSVs serve as input data for the TensorFlow/Keras-based model training.

Commands:

gsutil ls gs://${BUCKET}/babyweight/data/*.csv

This command verifies the creation of CSV files.


Step 5: Train the Model

Training was conducted on Google Cloud AI Platform using the TensorFlow framework.

Parameters:

  • Epochs: 10

  • Batch Size: 32

  • Machine Type: n1-standard-8

Command:

gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv

Step 6: Deploy the Model

The trained model, saved as saved_model.pb in the GCS bucket, was deployed to Google Cloud AI Platform. This deployment creates a REST API endpoint for making predictions.

Deployment Command:

gcloud ai-platform models create ${MODEL_NAME} --regions ${REGION}
gcloud ai-platform versions create ${MODEL_VERSION} \
--model=${MODEL_NAME} \
--origin=${MODEL_LOCATION} \
--runtime-version=2.6 \
--python-version=3.7

Conclusion

With the model deployed, you can now send requests to the REST endpoint to predict baby weights. This workflow demonstrates the power of Google Cloud tools in managing every stage of the ML lifecycle—from preprocessing and training to deployment.