Product Documentation - CSV Adapter
Introduction
This version 5 CSV adapter is designed to run in the new eMite Adapter Framework V5. It is a rewrite of the version 4 CSV adapter to simplify the whole process of onboarding CSV data into eMite. The previous version of the adapter could only onboard data into the SQL Server database (Data warehouse) using the Data warehouse module in eMite. This dependency had put in a couple of hard limitations and complexity for customers who wished to use this adapter to send data from their local premise to the cloud hosted instance of eMite.
The whole processed required:
Upload the CSV file to the cloud instance regularly as the CSV file became available via FTP.
The CSV files was processed into a cloud data warehouse.
Customers were then required to manually create a Fact Query over the data loaded into the data warehouse.
Schedule the running of the Fact Query that onboard data into the eMite indexes.
Create dashboards via the eMite Dashboard once the data was available in the indexes.
Steps 2 to 4 happened in the cloud while the customer did not have access to the logs or status of the data onboarding making it a very difficult process as constant support was required from eMite to figure out data isues, etc.
The new CSV adapter utilizes the cloud onboarding features of the new Adapter Framework (V5) to load data directly from local premise to cloud. It can also directly load CSV data into a data warehouse if need in the cloud but usually this is not required.
Below is the simplified process:
Install the new Adapter Framework (V5) on local premise and configure the CSV adapter to be ran.
Manually schedule adapter using Windows task Scheduler as required.
Create dashboards via the eMite Dashboard UI once the data has been onboarded into the indexes.
The above steps greatly simplify the CSV data onboarding and gives the customers access to the log files to figure out issues in the data during the onboarding process.
This article will guide you on how to setup and run the adapter from the local premise.
Prerequisites
To configure and run the CSV adapter, the following prerequisite is required:
Install the new Adapter Framework (v5) on premise
You can skip the above process if full eMite product has been installed as the Adapter Framework (v5) is already part of the full installation.
Deployment / Installation
To deploy the CSV Adapter:
Get the latest dlls for the CSV adapter.
Create a new adapter folder to hold the adapter dlls at the following location: [EmiteInstallationDirectory]\eMite\[EmiteVersion]\apps\Adapters\Pipeline\AddIns. This folder contains all the version 5 adapters contained in separate named folders. The name of the folder can be freely assigned, but for consistency, this should be the name of the adapter so that it can be easily recognized, e.g. CSV.
Copy all the CSV adapter dlls into the newly created directory
Go to [EmiteInstallationDirectory]\eMite\[EmiteVersion]\apps\Adapters and run the Z_ConfigurationLoader.bat. This will load the CSV adapter configurations into the eMite database via the configuration web service.
If this is being deployed on local premise, then this step will send all the configurations to the eMite instance in the cloud.Open the Schema admin to verify that the adapter template has been installed.
Configuration
Once the adapter has been deployed and verified, you will see the following configuration items in the adapter template.
The configuration items are described below as two categories
Adapter Configurations
CSVDropFolder - Set it to the folder location that contains all the CSV files.
ArchiveFiles - Set it to either true or false. This config item instructs the adapter to archive the CSV file once it has processed it. Note: Currently, there is no mechanism in place to age archive files.
DateControllerField - This is a DateTime field in the CSV that can be used as a savepoint column for incremental processing of data. If a value is not defined, a column called "DateController" is created automatically.
PrimaryKeyFields - A comma separated list of field in the CSV that uniquely defines a record in the file. If a value is not defined for this config item, a column named "PrimaryKey" is created automatically.
Delimiter - Default value is set to a comma (“,”). This config item tells the CSV adapter the character that is used to separate the fields in the CSV file.
FilterPattern - Use this field to specify a file name filter. The filter will be used to get only the CSV files that match this filter. E.g. "*test*.CSV". This filter will only bring back files that have the word "test" in it.
StorageType - Possible values: ElasticSearch/SqlServer. This config item specifies where to onboard the processed data. ElasticSearch option will directly onboard all the data into the eMite Indexes and SqlServer will load the data into the eMite database.
IndexGroup - Specifies the index group that the new index would belong to (if using ElasticSearch for StorageType config item). If the named IndexGroup does not exist, a new group with the specified name will be created.
IndexName - Specifies the name of the Index that would be created (if using ElasticSearch for StorageType config item). If SqlServer option is used for StorageType, then this config item signifies the TableName into which data would be loaded.
SchemaVerified - Possible values: true/false. This config item tells the adapter that CSV fields schema has been verified and it is ok to onboard the data to the chosen destination, i.e. ElasticSearch or SqlServer.
CSVMetaData - This field holds the discovered schema for the user to verify. The user can change the data type of fields using this config item. The schema info is in JSON format. Please exercise caution when editing this field.
DaysToKeepArchiveFiles - Numeric value which controls the number of days to keep archived CSV files. The default value is 14 days.
Datawarehouse - This is the name of the connection defined in the RTBI UI that will be used as the destination database for onboarding data into SQL Server. This configuration item is only valid when the StorageType = SqlServer.
DownloadUrls - The adapter can download CSV file using a URL. This configuration item allows the user to define multiple URLs separated by comma in the following format:
<URL>|<FileType>|<NewFileName>,<Url>|<FileType>|<NewFileName> e.g. https://url.com/test.CSV|CSV|NewFileName.CSVEnableDownload - This configuration item controls if the adapter instance can download CSV files from the URLs defined. Possible values: true/false
EnableAsyncDownload - This configuration item allows the download of the CSV files defined using the "DownloadUrls" config item in background while the CSV adapter is processing a file. Enabling Async download will run the download in parallel but will not end the adapter instance until the download has completed.
FirstRowIsHeader - This configuration item can be set to either true or false to indicate if the CSV file has a header row. The default value is true.
Adapter System Configurations
AdapterVersion - Leave this to the default value of "V5". This configuration item tells the eMite scheduler that a version 5 adapter is being ran.
Min Log Level - Possible values: info/warn/error/debug/trace. Controls the level of logging for the adapter instance.
CSVMetaDataFile - Column is not in use.
ElasticBulkSize - The default of this field is 5000 and should not be changed. This config item is used internally to configure the batch size for processing data into elastic search.
ElasticParallelThreads - The default value of this field is 2 and should not be changed.
ElasticUrl - The default value of this field is: http://localhost:9200 and should not be changed.
StartDateUTC - Use this field to specify the start date from which data should be onboarded.
Amazon S3 Configurations
AwsRegion - The associated AWS Region of the AWS S3 bucket e.g.us-west-2
AwsClientId - The access key to be used to sign requests made to AWS S3
AwsClientSecret - Secret access key which is used together with the AwsClientId to authenticate requests made to AWS S3
AwsAssumeRole - Delegated IAM role which has access to the associated AWS instance.
AwsExternalId - The Id used as an additional authentication for assuming a role when making requests to AWS.
AwsProxyPort - Proxy port number if a network proxy is in use.
AwsProxyHost - Proxy host name if a network proxy is in use.
AwsS3BucketName - The S3 bucket name where CSV files would be accessed from. This can contain sub folders, e.g. syd-cloudbackup/sts-testing
CSVFileName - The filename of the CSV which will be downloaded from S3. If this config is left as empty, all CSV files will be downloaded from the S3 bucket.
EnableS3Download - Enables/Disables the S3 download component
SFTP Configurations
EnableSFTPDownload - Enables/disables the SFTP component
SFTPHostName - Sftp server name
SFTPPort - Sftp port to use if specified. By default, SFTP uses port 22
SFTPFileName - CSV filename which will be downloaded from SFTP
SFTPDownloadFileName - The full path where the CSV will be downloaded to. If only the filename was specified, the path to be used would be CSVDropFolder.
SFTPUserName - Username to be used for connecting to the SFTP server
SFTPPassword - Password for connecting to the SFTP server
SFTPKeyFile - Key file used for authenticating SFTP download
SFTPKeyFilePassPhrase - If the SFTPKeyFile is encrypted, this pass phrase is used to decrypt the SFTPKeyFile.
DateTimeFormat
CSV adapter has new changes to the Datetime formatting now. Every CSV file can have a custom datetime format, but backend was not able to handle many formats before.
Now, the adapter is customized to parse different dateformats by using a dateformat text file in the below path.
This file is also added to the build so it will be always available with the adapter.
If a dateformat is not parsed properly and is showing only minimum dates in the datecolumn, then it means it is a formatting issue.
Hence, if any issue comes up with dateformat parsing, then please contact the developer so that they can debug and see the date format that is being passed in the backend - as Excel file shows as month before date but might pass the reverse in the backend.
This happens commonly, and hence only the developer can guess the correct format.
Now the good thing is, every format can be easily parsed by adding the new dateformat in the file unlike before.
The following date formats are supported by CSV adapter. Any new format can be determined by the developer and appended to above file like below to help new format parsed correctly.
|
|
|
|
Additional Features
Amazon S3
The adapter supports downloading CSV files from Amazon S3. To download files from S3, ensure that AWS credentials are fully supplied. The adapter supports different ways of AWS authentication like direct authentication using keys, role assumption, and EC2 attached role authentication.
This feature supports downloading to S3 buckets having subfolders. To download from subfolders, just specify the complete folder path in the AwsS3BucketName config. e.g. syd-cloudbackup/sts-testing.
This feature can also download a specific file or multiple files. To download a specific file, specify the filename in the CSVFileName config. If the CSVFileName is left as null/empty, all CSV files will be downloaded from the S3 bucket.
SFTP
The adapter also supports downloading CSV files from an SFTP server. The adapter downloads the specified CSV file via the SFTPFileName config from the specified server via the SFTPHostName and SFTPPort configs. The CSV file is downloaded to path specified in the SFTPDownloadFileName config. If full path is not specified, then the CSV file is downloaded in the CSVDropFolder config by default. The adapter supports username/password authentication of key file based authentication for SFTP.
Limitations
CSV files with data to be onboarded must include a row with field (column) names, by which the data will be known in eMite
Custom format for numeric values is currently not supported. Numeric values must be of “NNNN.NN” format, i.e. dot is used to define the decimal point and thousand separators are not allowed.
Timestamp format is currently only supported for DateControllerField and is to be specified as part of DateControllerField configuration separated by a pipe, e.g. DataTimeStamp|tsFormat, where tsFormat is the standard .Net date formats:
e.g. The chosen date controller field is called "DOB" and has a value as follows: 1960-03-08T15:30:00.000Z. The following format can be used in the DateControllerField configuration, e.g. DOB|yyyy-MM-ddTHH:mm:ss.fffZ
Timestamp onboarded as DateControllerField will not be converted/saved as UTC
Notes
Please note the general steps in using the new CSV adapter once it has been configured.
The configuration must be active (i.e. 'Active' check box must be checked) for the CSVMetaData to be populated
Run the CSV Adapter Instance. When the config item SchemaVerified is set to “false” and CSVMetaData has no data, the adapter will process the CSV files and do a discovery of columns. It will populate the CSVMetaData field with the discovered schema as shown below. Note: The configuration must be active. I.e. Active checkbox must be checked for the CSVMetaData to be populated.
Examine each discovered field and verify that they are correct.
Once the meta data has been verified to be correct set the SchemaVerified config item to true.
Schedule the adapter instance to run as regularly as desired. Each run after this would process the data and onboarded the data as configured.
Logs
All version 5 adapter logs can be found at the following location: [EmiteInstallationDirectory]\eMite\[EmiteVersion]\logs\Adapters\[AdapterName]\[InstanceName] as shown below.