I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content" field in the JSONL structure), must also accomplish the following two features:
- Every text's size in bytes, must be between 10 and 10000 bytes: DONE
- Every text encoding must be UTF-8: UNKNOWN
My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:
WITH mydata AS (
  SELECT
    CASE
      WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
      WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
      ELSE posting
      END AS posting
  FROM `my-project.Machine_Learning_Datasets.sample-data-source`  -- Modified for data protection
)
SELECT
  posting as content,  -- Something needs to be done here
  "text" as mimeType
FROM mydata
And my-project.Machine_Learning_Datasets.sample-data-source schema looks as follows:
| Field name | Type | Mode | Records | 
|---|---|---|---|
| posting | STRING | NULLABLE | 100M | 
Any ideas?
