Skip to content

Write a job query

The Query tab is where you write the logic for your notebook job using Opaque’s built-in editor. It supports both Python (PySpark) and PySpark SQL and provides a secure environment to write, test, and refine your job—whether you’re starting from scratch or modifying a reusable query.

You can load a starter script, reference encrypted datasets, define variables, test with synthetic data, and save your results—all without accessing raw data. The editor also supports pinned queries, which let you reuse logic across jobs in the same workspace.

This section walks through the full process of writing a secure, executable query—from loading data to building logic and saving results securely. For guidance on the editor capabilities that support this workflow, see Authoring tools.

Before you begin

Before writing your first query in Opaque, keep the following in mind:

  • Use the starter script: We strongly recommend starting with the built-in Python or PySpark SQL starter scripts. These scripts include essential boilerplate, such as loading encrypted datasets and saving your results. Without these components, your query will fail.
  • Replace placeholders: The starter script includes placeholder lines (like load_dataset_from_storage(...)). Be sure to replace them with actual dataset definitions using the copy icon in the Datasets panel.
  • Write in PySpark: Whether you use Python or SQL, your code must run on PySpark. For example:
    • Python queries must operate on PySpark DataFrames.
    • SQL queries are written using PySpark SQL, not vanilla SQL.
  • Define a save step: Every query must include a line that saves the results using the function write_dataset_to_storage(...).

Following these guidelines will help your job run successfully and avoid the most common sources of confusion.

Step 1. Use the starter scripts

To help you get started, Opaque provides default scripts for Python and SQL that include the required boilerplate and function calls. These scripts follow the structure expected by the editor—loading encrypted data, writing logic, and saving results—so you can focus on your query logic without having to set everything up manually.

To load a starter script:

  1. Click the code tag icon () in the editor toolbar.
  2. Choose either a Python or SQL starter template.
  3. After inserting a starter script, be sure to update the dataset path and query logic to match your job. Leaving the boilerplate unchanged will cause your job to fail.

Note

Opaque runs all Python code using PySpark. This means you'll be working with Spark DataFrames—not pandas—and using Spark syntax. For example, use df.select() instead of df['column'].

#### 1. Load data ####

## Load the encrypted dataset(s) needed for this job.

# To define your dataset, click the copy icon next to it in the Datasets panel.
# This copies a complete definition, including the dataset name and remote path,
# for example: merchant = load_dataset_from_storage("https://...").
# Replace the placeholder line below with the copied definition.

df = load_dataset_from_storage("https://<storageaccount>.dfs.core.windows.net/<storagecontainer>/<dataname>")

#### 2. Perform computation ####

## Replace this section with your query logic.

# Example: filter the dataset to include only rows where Age is greater than 30.

result = df.filter("Age > 30")

#### 3. Output results ####

## Save your results as an encrypted dataset.
# Opaque will automatically append a timestamp to prevent overwriting past results.

# Rename the output filename.
result_data_name = "job_result"

# Replace 'result' with the name of the DataFrame you want to save.
write_dataset_to_storage(result, result_data_name)

Note

Opaque SQL is powered by PySpark SQL. You must create temporary views from DataFrames before running SQL queries.

#### 1. Load data ####

## Load the encrypted dataset(s) needed for this job.

# To define your dataset, click the copy icon next to it in the Datasets panel.
# This copies a complete definition, including the dataset name and remote path,
# for example: merchant = load_dataset_from_storage("https://...").
# Replace the placeholder line below with the copied definition.

df = load_dataset_from_storage("https://<storageaccount>.dfs.core.windows.net/<storagecontainer>/<dataname>")

#### 2. Perform computation (SQL) ####

## Replace this section with your SQL query logic.

# First, register your DataFrame as a temporary SQL table.
df.createOrReplaceTempView("df")

# Then, write your SQL query using that table.
# Example: Filter rows where Age is greater than 30.

sql = """
SELECT *
FROM df
WHERE Age > 30
"""

# Run the SQL query and store the result as a new DataFrame.
result = spark.sql(sql)

#### 3. Output results ####

## Save your results as an encrypted dataset.
# Opaque will automatically append a timestamp to prevent overwriting past results.

# Rename the output filename.
result_data_name = "job_result"

# Replace 'result' with the name of the DataFrame you want to save.
write_dataset_to_storage(result, result_data_name)

Required modules

Opaque automatically includes a system-generated preamble at the top of the query editor. This non-editable code block is required for all notebook jobs and defines the required modules and helper functions—such as load_dataset_from_storage() and write_dataset_to_storage()—used to load data, save results, and run queries securely.

You don’t need to modify or import anything manually. This block is always included at runtime and ensures your query runs correctly—even if you write your job from scratch.

Step 2. Load encrypted datasets

To reference data in your query, you must first load it securely into a DataFrame—a table-like data structure used in PySpark to store and manipulate data in memory. Each DataFrame represents a dataset you can query, filter, or transform.

  1. In the Datasets list (left of the editor), locate the dataset you want to analyze. You can:
    • Search for the dataset by name or by the name of one of its columns.
    • Expand a dataset to view its schema. Columns appear in the same order as in the actual dataset.
    • If the dataset is missing from the list, ensure it has been shared with your workspace.
  2. Click Copy definition to clipboard on the right of the dataset name.
  3. In the query editor, place your cursor where you want to insert the definition and paste it.

The dataset definition is inserted into your current query using the built-in load_dataset_from_storage() function. You can then perform query operations on the resulting DataFrame.

In the starter script, replace this placeholder line with your dataset definition:

# Replace the placeholder line below with the copied definition.

df = load_dataset_from_storage("https://<storageaccount>.dfs.core.windows.net/<storagecontainer>/<dataname>")

Step 3. Write your logic

You can write your notebook job in Python (PySpark) or SQL with PySpark, depending on your workflow and preference.

If you started from the starter script, this is where you insert or modify your logic—but don't remove the load/save steps.

Write your logic using PySpark directly on the data. For example, to filter data where age is greater than 30, you’d write the following:

result = df.filter("Age > 30")

PySpark supports a wide range of transformations and actions, including filtering, joining, grouping, aggregation, and machine-learning tasks using Spark MLlib.

!!! info “Update dataset path”

The starter script includes the required boilerplate for loading and saving your results, but you'll need to update the dataset path; and if you rename your result DataFrame, also adjust the save line.

For example, change `write_dataset_to_storage(result, result_data_name)` to `write_dataset_to_storage(final_output, result_data_name)`.

To write a SQL query, you must first register your DataFrame as a temporary view. This makes your DataFrame available for SQL analysis.

# First, register your DataFrame as a temporary SQL table.
df.createOrReplaceTempView("df")

You can define multiple views if using more than one dataset.

Next, replace the contents of the sql variable with your query.

# Example: Filter rows where Age is greater than 30.

sql = """
SELECT *
FROM df
WHERE Age > 30
"""

Finally, use Spark SQL to run your SQL query.

# Run the SQL query and store the result as a new DataFrame.
result = spark.sql(sql)

Step 4. Save results

Before your job can complete, you must save your final results as an encrypted dataset using Opaque’s custom function.

Use write_dataset_to_storage(result, result_data_name), where:

  • result is the PySpark DataFrame you want to save.
  • result_data_name is a string that will be used as the output file name.

    Note

    The second parameter, result_data_name, must not contain any spaces.

Opaque automatically appends a timestamp to the saved dataset name so that each run generates a unique result and avoids overwriting previous output.

# Rename the output filename.
result_data_name = "job_result"

# Replace "result" with the name of the DataFrame you want to save.
write_dataset_to_storage(result, result_data_name)

Authoring tools

To streamline query development, Opaque provides tools that help you explore, reuse, and adapt jobs to different inputs—without exposing sensitive data. These features support secure, flexible workflows across jobs and use cases.

Query editor

The built-in query editor helps you write, understand, and debug your job securely and efficiently. Key features include:

  • Syntax highlighting: Color-codes code elements (keywords, operators, strings, etc.) and visually distinguishes:
    • Dataset names and column names (with hover to see details).
    • Input variables (with hover to reveal data type).
  • Autocompletion: Suggests relevant terms as you type, including:
    • Keywords, dataset and column names, input variables, and common functions.
    • Use arrow keys to navigate and Enter/Tab to insert. Press Esc to close suggestions.
  • Find and replace: Press Command+F (Mac) or Ctrl+F (Windows) to search within the query. Refine your search with:

    • RegExp Search: Match patterns using regular expressions.
    • CaseSensitive Search: Match exact casing.
    • Whole Word Search: Match full words only.
    • Search in Selection: Limit to selected text.

    To replace text:

    • Click Toggle Replace mode.
    • Enter the replacement.
    • Replace one match at a time or all at once.
    • Code annotations: When standard boilerplate code appears, the editor displays a tooltip icon next to the line. Hover to view helpful tips and context.

Pinned queries

You can pin a query to reuse it in future jobs. Pinned queries store the full contents of your current query editor and appear in the Pinned Queries list on the Query tab.

Pin a query

  1. On the Query tab, click Pin Query.
  2. Enter a name for the query (max 50 characters).
  3. Click Save.

The query is added to the end of your Pinned Queries list with an encircled check mark next to the name.

Use a pinned query

  1. On the Query tab, select Pinned Queries.
  2. Locate the query you want to reuse. You can:
    • Search by query name (search is case-sensitive).
    • Expand the query to preview its contents.
  3. Click Copy to clipboard on the right of the query name.
  4. In the query editor, place your cursor and paste.

The pinned query is added to your current editor content.

Delete a pinned query

  1. On the Query tab, select Pinned Queries.
  2. Locate the query you want to delete. You can:
    • Search by query name (search is case-sensitive).
    • Expand the query to preview its contents.
  3. Click Delete next to the query name.
  4. Click Confirm to permanently delete the pinned query.

The selected query is removed from your list.

Input variables

You can define input variables to be used in a query. Input variables must be assigned values before a job is run, making it easy to reuse queries with different inputs.

Create an input variable

  1. On the Query tab, select + Variable (near the Pinned Query tab).
  2. Enter a name for the variable (max 50 characters).
  3. Select a variable type:
    • string
    • integer
    • float
    • boolean
  4. (Optional) Add a description for the variable.
  5. Choose whether the variable should be required before each job run.
  6. If not marked as required, enter a default value.
  7. Click Save.

The variable is added to the input variables list and appears in the non-editable section of the query editor. It can now be used throughout your query.

Edit an input variable

  1. On the Query tab, select Input Variables.
  2. Locate the variable you would like to edit.
  3. Click Edit next to the variable name.
  4. Modify any fields you would like.
  5. Click Save.

Delete an input variable

  1. On the Query tab, select Input Variables.
  2. Locate the variable you want to delete.
  3. Click Delete next to the variable.
  4. Click Confirm to delete the variable.

Use an input variable in a query

  1. On the Query tab, select Input Variables to view the list.
    • Click a variable to expand and show its details.
    • Hover over a variable to highlight its occurrences in the query editor.
  2. Use the variable’s name directly in your query. It will appear underlined and color-coded in the editor.

    Using variables within strings

    To use an input variable inside a string, use a python formatted string literal—surround the variable name in {} and prefix the string with f. For example, the variable named minimum_age would be used like so:

    f"Age > {minimum_age}"

Assign input variable values before running a job

Before running an approved job, you must assign values to all input variables.

  1. On the job’s Query tab for an approved job, select Input Variables to view the list of variables.
    • Click a variable to expand and show its details.
    • Hover over a variable to highlight its usage in the query.
  2. Variables marked with an asterisk * are required and must be set before the job can be run.
  3. Optional variables are pre-filled with default values, but you can override them before running the job.
  4. After confirming all values, click Run Job.

    Previous values

    Input values from past runs are saved. When you click a variable’s input box, you can select from previously used values via a dropdown.

Next steps

Once you’ve written and saved your query, the next step is to test the job using synthetic data. This enables you to validate your logic and verify the output—without exposing any sensitive information. For details, continue to Using test data.