Install GX
GX 1.0 is a Python library. Follow the instructions in this guide to install GX in your local Python environment, or as a notebook-scoped library in hosted environments such as Databricks or EMR Spark clusters.
Prerequisites
- Python version 3.8 to 3.11
- Recommended. A Python virtual environment
- Internet access
- Permissions to download and install packages in your environment
Install the GX Python library
- Local
- Hosted environment
- GX Cloud
GX 1.0 is a Python library and as such can be used with a local Python installation to access the functionality of GX through Python scripts.
Installation and setup
-
Optional. Activate your virtual environment.
If you created a virtual environment for your GX Python installation, browse to the folder that contains your virtual environment and run the following command to activate it:
Terminal inputsource my_venv/bin/activate
-
Ensure you have the latest version of
pip
:Terminal inputpython -m ensurepip --upgrade
-
Install the GX 1.0 library:
Terminal inputpip install great_expectations
-
Verify that GX installed successfully with the terminal command:
Terminal inputgreat_expectations --version
If GX was successfully installed, the following output appears:
Terminal outputgreat_expectations, version 1.0.0a4
Hosted environments such as EMR Spark or Databricks clusters do not provide a filesystem to install your GX instance. Instead, you must install GX in memory using the Python-style notebooks available on those platforms.
- EMR Spark notebook
- Databricks notebook
Use the information provided here to install GX on an EMR Spark cluster and instantiate a Data Context without a full configuration directory.
Additional prerequisites
- An EMR Spark cluster.
- Access to the EMR Spark notebook.
Installation and setup
-
To install GX on your EMR Spark cluster copy this code snippet into a cell in your EMR Spark notebook and then run it:
Pythonsc.install_pypi_package("great_expectations")
-
Create an in-code Data Context. See Instantiate an Ephemeral Data Context.
-
Copy the Python code at the end of How to instantiate an Ephemeral Data Context into a cell in your EMR Spark notebook, or use the other examples to customize your configuration. The code instantiates and configures a Data Context for an EMR Spark cluster.
-
Execute the cell with your Data Context initialization and configuration.
-
Run the following command to verify that GX was installed and your in-memory Data Context was instantiated successfully:
Pythoncontext.list_datasources()
Databricks is a web-based platform that automates Spark cluster management.
To avoid configuring external resources, you'll use the Databricks File System (DBFS) for your Metadata Stores and Data Docs store.
DBFS is a distributed file system mounted in a Databricks workspace and available on Databricks clusters. Files on DBFS can be written and read as if they were on a local filesystem by adding the /dbfs/ prefix to the path. It also persists in object storage, so you won’t lose data after terminating a cluster. See the Databricks documentation for best practices, including mounting object stores.
Additional prerequisites
- A complete Databricks setup, including a running Databricks cluster with an attached notebook
- Access to DBFS
Installation and setup
-
Run the following command in your notebook to install GX as a notebook-scoped library:
Terminal input%pip install great-expectations
A notebook-scoped library is a custom Python environment that is specific to a notebook. You can also install a library at the cluster or workspace level. See Databricks Libraries.
-
Run the following command to import the Python configurations you'll use in the following steps:
Pythonimport great_expectations as gx
from great_expectations.checkpoint import Checkpoint
from great_expectations.core.expectation_suite import ExpectationSuite -
Run the following code to specify a
/dbfs/
path for your Data Context:Pythoncontext_root_dir = "/dbfs/great_expectations/"
-
Run the following code to instantiate your Data Context:
Pythoncontext = gx.get_context(context_root_dir=context_root_dir)
GX Cloud provides a web interface for using GX to validate your data without creating and running complex Python code. However, GX 1.0 can connect to a GX Cloud account if you want to customize or automate your workflows through Python scripts.
Installation and setup
To deploy a GX Agent, which serves as an intermediary between GX Cloud's interface and your organization's data stores, see Connect GX Cloud. The GX Agent serves all GX Cloud users within your organization. If a GX Agent has already been deployed for your organization, you can use the GX Cloud online application without further installation or setup.
To connect to GX Cloud from a Python script utilizing a local installation of GX instead of the GX Agent, see Connect to an existing Data Context.