1. Home
  2. Data & Files
  3. Uploading & importing data
  4. Mapping structured CSV file data to a new data table

Mapping structured CSV file data to a new data table

Workspaces attempt to automatically interpret CSV files, determining details like the character encoding, column delimiter, text-qualifier, the names of the table and columns from the data in the CSV file. Sometimes, the values for these properties fall outside the range that the platform can automatically determine and in such cases, the details need to be supplied in a table definition file (TDF) sent along with the CSV file.

If supplied, the table definition file must be uploaded before the associated CSV file, and have the same name as the CSV file, but with the extension ".xml". For example, if you supply a data file called myresearchdata.csv, the corresponding TDF must be called myresearchdata.xml.

 

The format of the table definition file

Table definition files are XML documents containing the following elements:

<?xml version="1.0" encoding="utf-8"?>
<TableDefinition TableName='the filename or what the user specifies'Action='create or append'>
<Columns>
<Column Name="column-name" Type="##PostgreSQL data-type##" />...</Columns>
<Format 
Delimiter='column-delimiter'
TextQualifier='text-qualifier'
NullQualifier='replace these values with null'
Encoding='UTF-8'
Header='first line of csv is the header'
HeaderCase='change the column-header casing'
DateFormat='the format of all dates in the csv'
/>
<Success RemovefromUpload='delete source files on SFTP'/>
<Fail RemovefromUpload='delete source files on SFTP'/>
</TableDefinition>

Please note that:

  • A PostgreSQL data type definition cannot contain column constraints or default values. See http://www.postgresql.org/docs/8.2/static/datatype.html for more details about PostgreSQL data types. Please note that the platform is built on the Greenplum database, which in turn is based on PostgreSQL 8.2.
  • Where columns are supplied they must match the number of columns in the accompanying CSV file.
  • The file is case-sensitive, so ensure that all attributes and elements follow the casing set out in the templates outlined here.
  • A limited set of XML Named Character References are supported for the TDF.
 

The table definition element

TableDefinition

Providing the TableDefinition element is mandatory; any missing or blank attributes will be handled as follows:

TableName

The destination table for the accompanying data file that will be loaded into the platform. If not supplied or blank, use the file name.

Action

Specifies whether the data is loaded into a brand-new table (create) or inserted at the end of an existing table (append). If not supplied or blank, "create" is assumed.

For create:

  • If a table of the same name already exists, it is renamed with a timestamp before the new table is created.

For append:

  • If a table with the name does not exist, then a new one is created.
  • If data being appended is narrower than the destination table, then all missing values are set to null.
  • If data being appended is wider than the destination table, then the surplus columns are quietly omitted.

Columns

The columns element should contain one Column element per column in the data to be uploaded.

Column

Name

The name to give the column in the database table.

Description

Field level metadata providing a description of the data in this column. This will appear alongside the column name in the dataset preview in the Workspace.

Type

The PostgreSQL type the data in this column should be stored as. http://www.postgresql.org/docs/8.2/static/datatype.html

 

Data transformation during upload processes

When your CSV file is loaded to the platform it is stored in Workspaces as a table. Please note that when loading without supplying table definition file, all special characters [, . ? , " $ % ^ ( ! # )] in the file name will be transformed to underscore "_" and capital letters will become lowercase, e.g. if you upload file Aridhia-1.csv table, it would be saved as aridhia_1.

This also applies to table column headings, e.g. Column.1 would change to column_1.

Here is the list of all changes which are made by the platform when uploading via SFTP.

  • Within the file name, the following special characters are replaced by an underscore (_):
    – caret (^)
    – full stop (.)
    – single quote (')
    – double quote (")
    – space ( )
    – dash (-)
  • An empty value with no quotations is translated to NULL.
  • An empty value with quotations is translated to an empty string.
  • All columns in the table are nullable.
  • Missing fields at the end of rows are assumed to be NULL.
  • If a column name is not supplied, the column name is given a unique name: column1, column2.
  • Column names:
    – Will be trimmed to remove any leading or trailing space characters.
    – Will be trimmed to 60 characters.
    – Will be made unique by appending a number to the name.
  • Within a column name, the following special characters are replaced by an underscore (_):
    – caret (^)
    – full stop ()
    – single quote (')
    – double quote (")
    – space ( )
  • The platform does not support the data type "timestamp with timezone" – any field of this type is treated as a "timestamp without time zone".

We could otherwise use a TDF to specify table names and column headers. Please refer to The Table Definition Element section above.

Updated on June 19, 2019

Was this article helpful?

Related Articles

Not the solution you were looking for?
Click the link below to submit a support ticket
CONTACT SERVICE DESK