abacusai.dataset
Module Contents
Classes
A dataset reference |
- class abacusai.dataset.Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, schema={}, refreshSchedules={}, latestDatasetVersion={})
Bases:
abacusai.return_class.AbstractApiClass
A dataset reference
- Parameters:
client (ApiClient) – An authenticated API Client instance
datasetId (str) – The unique identifier of the dataset.
sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.
dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.
createdAt (str) – The timestamp at which this dataset was created.
ignoreBefore (str) – The timestamp at which all previous events are ignored when training.
ephemeral (bool) – The dataset is ephemeral and not used for training.
lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.
databaseConnectorId (str) – The Database Connector used.
databaseConnectorConfig (dict) – The database connector query used to retrieve data.
connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.
featureGroupTableName (str) – The table name of the dataset’s feature group
applicationConnectorId (str) – The Application Connector used.
applicationConnectorConfig (dict) – The application connector query used to retrieve data.
incremental (bool) – If dataset is an incremental dataset.
isDocumentset (bool) – If dataset is a documentset.
latestDatasetVersion (DatasetVersion) – The latest version of this dataset.
schema (DatasetColumn) – List of resolved columns.
refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.
- __repr__()
Return repr(self).
- to_dict()
Get a dict representation of the parameters in this class
- Returns:
The dict value representation of the class parameters
- Return type:
- create_version_from_file_connector(location=None, file_format=None, csv_delimiter=None)
Creates a new version of the specified dataset.
- Parameters:
location (str) – A new external URI to import the dataset from. If not specified, the last location will be used.
file_format (str) – The fileFormat to be used. If not specified, the service will try to detect the file format.
csv_delimiter (str) – If the file format is CSV, use a specific csv delimiter.
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_database_connector(object_name=None, columns=None, query_arguments=None, sql_query=None)
Creates a new version of the specified dataset
- Parameters:
object_name (str) – If applicable, the name/id of the object in the service to query. If not specified, the last name will be used.
columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.
query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.
sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override objectName, columns, and queryArguments
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_application_connector(object_id=None, start_timestamp=None, end_timestamp=None)
Creates a new version of the specified dataset
- Parameters:
object_id (str) – If applicable, the id of the object in the service to query. If not specified, the last name will be used.
start_timestamp (int) – The Unix timestamp of the start of the period that will be queried.
end_timestamp (int) – The Unix timestamp of the end of the period that will be queried.
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_upload(file_format=None)
Creates a new version of the specified dataset using a local file upload.
- snapshot_streaming_data()
Snapshots the current data in the streaming dataset for training.
- Parameters:
dataset_id (str) – The unique ID associated with the dataset.
- Returns:
The new Dataset Version created.
- Return type:
- set_column_data_type(column, data_type)
Set a column’s type in a specified dataset.
- Parameters:
column (str) – The name of the column.
data_type (str) – The type of the data in the column. INTEGER, FLOAT, STRING, DATE, DATETIME, BOOLEAN, LIST, STRUCT Refer to the (guide on data types)[https://api.abacus.ai/app/help/class/DataType] for more information. Note: Some ColumnMappings will restrict the options or explicitly set the DataType.
- Returns:
The dataset and schema after the data_type has been set
- Return type:
- set_streaming_retention_policy(retention_hours=None, retention_row_count=None)
Sets the streaming retention policy
- get_schema()
Retrieves the column schema of a dataset
- Parameters:
dataset_id (str) – The Dataset schema to lookup.
- Returns:
List of Column schema definitions
- Return type:
- refresh()
Calls describe and refreshes the current object’s fields
- Returns:
The current object
- Return type:
- describe()
Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.
- list_versions(limit=100, start_after_version=None)
Retrieves a list of all dataset versions for the specified dataset.
- Parameters:
- Returns:
A list of dataset versions.
- Return type:
- attach_to_project(project_id, dataset_type)
[DEPRECATED] Attaches the dataset to the project.
Use this method to attach a dataset that is already in the organization to another project. The dataset type is required to let the AI engine know what type of schema should be used.
- Parameters:
project_id (str) – The project to attach the dataset to.
dataset_type (str) – The dataset has to be a type that is associated with the use case of your project. Please see (Use Case Documentation)[https://api.abacus.ai/app/help/useCases] for the datasetTypes that are supported per use case.
- Returns:
An array of columns descriptions.
- Return type:
- remove_from_project(project_id)
[DEPRECATED] Removes a dataset from a project.
- Parameters:
project_id (str) – The unique ID associated with the project.
- delete()
Deletes the specified dataset from the organization.
The dataset cannot be deleted if it is currently attached to a project.
- Parameters:
dataset_id (str) – The dataset to delete.
- wait_for_import(timeout=900)
A waiting call until dataset is imported.
- Parameters:
timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- wait_for_inspection(timeout=None)
A waiting call until dataset is completely inspected.
- Parameters:
timeout (int, optional) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- get_status()
Gets the status of the latest dataset version.
- Returns:
A string describing the status of a dataset (importing, inspecting, complete, etc.).
- Return type:
- describe_feature_group()
Gets the feature group attached to the dataset.
- Returns:
A feature group object.
- Return type:
- create_refresh_policy(cron)
To create a refresh policy for a dataset.
- Parameters:
cron (str) – A cron style string to set the refresh time.
- Returns:
The refresh policy object.
- Return type:
- list_refresh_policies()
Gets the refresh policies in a list.
- Returns:
A list of refresh policy objects.
- Return type:
List[RefreshPolicy]