File
Last updated
Last updated
ONLY use the file connector included in the edge-connector to use the functions under Edge-Connector specific functions. The file connector under data connectors in the app builder does NOT have file permanence and should not be used for storing and editing files.
The File connector can read specific file types, write .csv files, and list folder contents.
When used from an edge-connector, it can also move, copy, or delete files.
To access and manage files on your own device, please go to the edge-connector page and return here, once you have successfully deployed it.
There are two ways to read a file. When using an edge-connector, simply insert the path relative to the root folder into the input box. When using the function in combination with the file browser of the Heisenware platform, you can drag and drop files from the file browser on the bottom left into the path input box.
The readFile
function is a cumulative function of the others, currently supporting .csv
, .xlsx
, .txt
, .pdf
and .xml
file types.
Parameters can be added in object format parameter: value
. For possible parameters of each file type see below.
To read a character separated values-file use the readCsv
function and insert the path into the input box. Parameters can be added in object format in the second box.
See https://www.npmjs.com/package/csvtojson#parameters for more parameter options.
Use the readXlsx
function to ingest Excel files.
Output Example:
Use the following parameters to change how the sheets are interpreted.
It is also possible to nest most parameters for different options per sheet. See https://github.com/DiegoZoracKy/convert-excel-to-json for all parameter options and examples.
The readDocx
function is working with textract in the background. Textract is a module for extracting mainly text from multiple types of files. It will extract text, including header, footer and hyperlinks from Word documents.
For more options visit https://www.npmjs.com/package/textract#configuration.
The following functions rely on the same textract component, but for different file types different parameters might be sensible.
The readPptx
function allows you to ingest text from PowerPoint files and also works with textract. For parameter options see Read a Word file.
The readHtml function allows for reading of HTML files in .html and .htm format. It also works with textract, so it has the same parameter options as the function for Word files.
Additionally, you can use the parameter includeAltText
: When extracting HTML, this decides whether or not to include the alternative text that is supposed to be rendered, if an object doesn't load, like a short description of a picture. By default this is false
.
This function can only extract content from the HTML file uploaded and is unable to extract content from webpages utilising multiple .htm or .html files. Files using frames cannot be extracted at all.
Reading a text file is also textract based (see Word file) and can be done with the function readTxt
.
The function readMd
is also textract based (see Word file) and can be used to ingest markdown files.
PDFs can be read with the function readPdf
, which is also textract based like the Word function. The extracted information includes some metadata and the text inside the PDF file.
Additionally to the parameters explained above, for PDF functions there is also pdftotextOptions
. Parameters entered here need to be nested with pdftotextOptions
being the main key and all sub options being nested in the input object like pdftotextOptions: {ownerPassword: 123}
.
pdftotextOptions
is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword
and userPassword
if you are extracting text from password protected PDFs.
Textract modifies the pdf-text-extract layout
default so that, instead of
layout: layout
, it uses layout: raw
. Do not modify this without understanding what problems might arise. See this GH issue for why textract overrides that library's default.
To read extended markup language files use the readXml
function. It is also textract based and therefore offers the same parameters as the Word function.
You can write .csv
files with the writeCsv
function. Simply insert or link some JSON string in the first input box and specify a path and filename for the file to be saved to. The delimiter of the values will always be a comma.
Parameters from the readCSV function can currently NOT be used for writing .csv
files.
Use the moveFile
function to remove a file from one location and insert it in a new one. Simply insert the old path, including the file name, in the first input box and the new path, also inclusive of file name, in the second input box. Moving a file into another folder is only possible if it already exists.
With the copyFile
function, you can duplicate files and simultaneously move the copy to a new location.
To delete a file, use the deleteFile
function.
Files get deleted directly and cannot be recovered!
With the createFolder
function, you can create a new subfolder in one of the existing ones.
Delete subfolders with the deleteFolder
function.
Folders, even if they are not empty, get deleted completely, including all of the content. They cannot be recovered afterwards!
To show the content of a folder and all its subfolders, insert the path to the folder in the input field.
The path the file connector is operating in is your starting point for all the path input fields. To see the contents of the root folder and all subfolders, insert a .
into the path input box.
To go up a level in the folder structure, use ..
in the beginning of the input field. See here for more information about navigating in the path input box.
Parameter | Default | Description |
---|---|---|
Parameter | Example | Description |
---|---|---|
Parameter | Default | Description |
---|---|---|
delimiter
,
Delimiter used for separating columns. Use "auto" if delimiter is unknown in advance. Use an array to give a list of potential delimiters e.g. [", " "|", "$"].
noheader
false
Indicating csv data has no header row and first row is data row.
checkColumn
false
Check whether column number of a row is the same as headers. If column number mismatched headers number, an error of "mismatched_column" will be emitted.
checkType
false
Turns on and off type interpretation
quote
"
If a column contains delimiter, it is able to use quote character to surround the column content. e.g. "hello, world" won't be split into two columns while parsing. Set to "off" will ignore all quotes.
trim
true
Indicate if parser trim off spaces surrounding column content. e.g. " content " will be trimmed to "content".
ignoreEmpty
false
Ignore the empty value in CSV columns. If a column value is not given, set this to true to skip them.
includeColumns
This parameter instructs the parser to include only those columns as specified by the regular expression. Example: /(name|age)/ will parse and include columns whose header contains "name" or "age"
ignoreColumns
This parameter instructs the parser to ignore columns as specified by the regular expression. Example: /(name|age)/ will ignore columns whose header contains "name" or "age"
header
{rows: 1}
This is the number of rows that will be skipped and will not be present in the resulting object. Counting from top to bottom.
sheets
['sheet2']
Only get the data from a specific sheet.
columnToKey
static example:
{ A: 'id',
B: 'firstName' }
dynamic example:
{ '*': '{{columnHeader}}' }
Name columns in the output.
It is possible to use a value from the sheet with e.g. '{{A1}}'
or {{columnHeader}}
, which will follow the header parameter settings. To dynamically name every column, use '*'
. Omitting columns ignores them for the output, even if specified in the range parameter.
range
'A2:B3'
Defines the range from which to include data. If your column range goes into double characters (e.g.: AG), set the range from A until Z, because the submodule currently works with alphabetical sorting internally.
sheetStubs
false
To include empty cells (NULL values), set this to true
.
preserveLineBreaks
false
Pass this in as true
and textract will not strip any line breaks. Line breaks are preserved most of the time even with false
, but to make sure to preserve line breaks, set this to true
.
preserveOnlyMultipleLineBreaks
false
Some extractors, like PDF, insert line breaks at the end of every line, even in the middle of a sentence. If this option is set to true
, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, as this doesn't preserve paragraphs unless there are multiple breaks.