Add data

When you add a dataset to Opaque, it’s automatically encrypted. A test dataset is also generated to support secure job development. Follow these guidelines to prepare for the upload process:

Ensure your data file matches supported formats.
Create a schema file for your dataset.
Decide which data columns and rows you want to be accessible in jobs.

Supported data types

Data Type	Accepted Formats	Available Operators
boolean	`1`, `0`, `true`, `false`	`=`, `!=`, `is empty`, `is not empty`
string	Any combination of characters	Same as boolean
integer	Int32 (no commas)	`=`, `!=`, `<`, `>`, `<=`, `>=`
long	Int64 (no commas)	Same as integer
float	Single-precision, no commas	Same as integer
double	Double-precision, no commas	Same as integer
date ISO	`YYYY-MM-DD`	Same as integer
date US	`MM/DD/YYYY`	Same as integer
date EU	`DD.MM.YYYY`	Same as integer
timestamp	`YYYY-MM-DD hh:mm:ss.ms` (optional)	Same as integer

Add a dataset

Adding a dataset to your organization is a four-step process as shown in the following figure.

Follow these steps to get started:

Start the upload:
- On the primary navigation bar, select Data Management and click + Dataset to open the upload window.
- You can exit the process anytime by clicking Close or Cancel. Resuming will continue where you left off.
Upload a schema file:
- Click Browse to select your schema file (TXT or JSON) or drag it into the upload box.
- Click Next.
Set schema rules:
- Select which columns of your dataset can be used in queries. Columns are listed as defined in the schema file.
- (Optional) Select Share all columns with other workspace members to make all columns accessible.
- Click Next.
Upload the data file:
- Add a unique dataset name and a description.
- Choose a data source; your options include:
  - Upload local file: With this option, you can upload a file of up to 700 GB in size.
  - Choose a cloud sources: Your options include Amazon S3, Azure Blob Storage, Azure File Storage, or Google Cloud Storage. You’ll need the following details to complete this step:
    - Amazon S3:
      - The name of the S3 bucket where the data file is stored
      - The name of the data file (that is, the object key)
      - Your AWS access key ID
      - Your AWS secret access key
      See Managing access keys for IAM users for information about finding your AWS access keys. - Azure Blob Storage: - The name of the container where the data file is stored - The name of the blob (that is, the data file) - The connection string to your storage account - Azure File Storage: Enter the following information: - The name of the Azure file share where the data file is stored - The path to the file within the share - The connection string to your storage account - Google Cloud Storage: - The name of the Cloud Storage bucket where the data file is stored - The name of the object (that is, the data file) - The contents of your service account key file in JSON format
- Click Next to configure data rules.
(Optional) Set data rules:
- Use row rules to specify conditions for data rows accessible in jobs. (See Apply a data policy for details.)
- Condition Selector: Select a column, operator, and value. Add multiple conditions if needed.
- Script Editor: Write SQL conditions directly.
Enable test data generation (notebook jobs only).
- Choose one of the following test data types:
  - Maintain dataset distributions and format: Generates realistic test data with similar statistical properties to the original dataset.
  - Random dummy data: Creates data based on column names and types for added privacy, but with lower utility for queries.
- See also Test data in Opaque.
Click Upload to begin transferring and encrypting your data.
- A progress bar will track the upload and encryption process.
- You can cancel transfers from local files before they reach 50% completion. Transfers from cloud sources cannot be canceled once started.

After a successful upload, your dataset will appear in the Data Management table.

Apply schema and policy rules

Every connected dataset is associated with a schema and a policy. This section explains how to apply both when uploading a dataset.

Create a schema file

A schema file tells Opaque how to interpret your data. It lists each column and its data type, in the same order as your data file and separated by a comma without spaces.

Example of a schema file:

Name:string,Age:integer,Glucose:integer,BMI:float

Apply a data policy

Policies specify which columns and rows can be accessed in jobs:

Column rules define the columns in a dataset that can be referenced in queries.
- Select accessible columns during step 3, Set schema rules, above.
Row rules define which records are visible in jobs—for example, to limit access based on age or region.
- Define row access using conditions or SQL scripts (see step 5, Upload dataset, above).
- For example, based on the previous schema example, if you don’t want to share information about anyone in your dataset age 50 or older, you have two options:
  - Condition selector: Select Age from the Column list, select < from the Operator list, and enter 50 in the Value field.
  - Script editor: Click Use a SQL Condition and enter age < 50 in the editor.

Test data in Opaque

To support job development without exposing sensitive information, Opaque automatically generates test datasets for use during query authoring and testing. These datasets let job creators develop logic and validate output without needing approval to access real data, making it easier to design and explore queries safely.

Test data can be created:

During the dataset upload process or
After job results are produced (if no test data was generated initially).

Opaque offers two types of test data:

Dataset-similar test data, which mimics the format and statistical properties of the original dataset while anonymizing personally identifiable information (PII). This is ideal for realistic query development with strong privacy guarantees.
Random dummy data, which contains only randomly generated values based on column names and data types. This offers stronger privacy protections, but may be less useful for logic testing or result validation.

Both options provide a safe, policy-compliant way to build and test jobs before touching real data.