Data Cloud data is currently available in the following formats: CSV, JSON, or Parquet. Files are encoded as utf-8.
The format of the CSV files are RFC 4180-compliant, so they should easily integrate with most applications that support CSV file formats.
<aside> 💡 We’re open to publishing data in other formats. Please reach out if the currently supported formats below do not easily meet your use case.
</aside>
In analytical database design (or dimensional database design), data is modeled to optimize for data retrieval and analysis. Data is organized into facts (measurements, like sales, clicks, orders, etc.) and dimensions (context descriptors, like country, customer, seller, etc.). Data Cloud takes a similar approach for organizing data into tables. Most Data Cloud tables can be classified as either a fact table or a dimension table.
With delivery to an S3 bucket, most data sets are published as two separate tables. For example, for the sales_estimates_weekly data set you’ll see both a sales_estimates_weekly table (which contains all historical data partitioned by date) and sales_estimates_weekly_latest table (which contains only the most recent delivery). The *_latest table is continually updated with the latest available time series or dimension data and will be overwritten during each data delivery cycle.
The *_latest table will normally contain only the latest delivery of data (sometimes broken into multiple files). For instance, the sales_estimates_weekly data is a weekly grain of data, so each delivery to the *_latest folder will be for only the most recent week.
<aside>
⚠️ Note, if an anomaly is detected, multiple weeks of data may be delivered to the *_latest folder. This will require an upsert to amend previously records with incorrect data. If this situation occurs, you will be notified prior to delivery. It is best practice to create a system that detects multiple periods of data and upserts automatically.
</aside>
Generally, customers will reference the *_latest tables for their analysis. However, this design allows customers to build up slowing changing tables in their own Data Warehouse systems using historical snapshots if needed. This format is used to allow for ingestion of recurring data without having to change the path name.
Data Cloud data is partitioned using “hive style partitioning”. This means data is organized with paths that contain key value pairs like:
version=2/format=csv/table=sales_estimates_weekly/marketplace=US/...
Data Cloud partitions start with a version. The version partition is used if Jungle Scout ever needs to introduce breaking changes to Data Cloud data without impacting customers. We don't anticipate this version changing frequently.
The table partition is used to represent the “table” you can expect within a folder hierarchy. You can expect the data schema to be consistent for all files within a give table partition.
Fact tables will have time-based partitions (e.g., year , month, week and/or day) that represent the date the facts were recorded or generated. It’s important for automated data ingestion processes to consider the restatement period when continually integration Data Cloud data into customer-managed systems. While we work to limit restatements unless necessary, there are instances where restatements will be necessary and communicated ahead of time with customers.
Data for a given table may be broken into multiple files. This is intentional and allows for data to be ingested in parallel, which is critical for tables with a large amount of data.