How to use ChupaText as command line tool
You can extract text and meta-data from an input by chupa-text
command. chupa-text
prints extracted text and meta-data as JSON.
Input
chupa-text
command accept a local file path or a URI.
Here is a local file path example:
% chupa-text hello.txt.gz
Here is an URI example:
% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz
Output
chupa-text
command prints the extracted result as JSON:
% chupa-text hello.txt.gz
{
"mime-type": "application/x-gzip",
"uri": "hello.txt.gz",
"size": 36,
"texts": [
{
"mime-type": "text/plain",
"uri": "hello.txt",
"size": 6,
"body": "Hello\n"
}
]
}
JSON uses the following data structure:
{
"mime-type": "<MIME type of the input>",
"uri": "<URI or path of the input>",
"size": <Byte size of the input data>,
"other-meta-data1": <Other meta-data value1>,
"other-meta-data2": <Other meta-data value2>,
"...": <...>,
"texts": [
{
"mime-type": "<MIME type of the extracted data1>",
"uri": "<URI or path of the extracted data1>",
"size": "<Byte size of the text of the extracted data1>",
"body": "<The text of the extracted data1>",
"other-meta-data1": <Other meta-data value1 of the extracted data1>,
"other-meta-data2": <Other meta-data value2 of the extracted data1>,
"...": <...>
},
{
<The information of the extracted data2>
},
{
<The information of the extracted data3>
},
<...>
]
}
You can find extracted texts in texts[0].body
, texts[1].body
and
so on. You may extract one or more texts from one input because
ChupaText supports archive file such as tar
.
Command line options
You can custom chupa-text
command behavior. Here are command line
options:
--configuration=FILE
It reads configuration from FILE
. See the next section for
configuration file details.
ChupaText provides the default configuration file. It has suitable configurations. Normally, you don't need to use your custom configuration file.
--help
It shows available command line options and exits.
Configuration
ChupaText configuration file is a Ruby script but it is easy to read and write ChupaText configuration file for users who don't know about Ruby.
The basic syntax is the following:
category.name = value
Here is an example that sets ["tar", "gzip"]
as value
to names
name variable in decomposer
category:
decomposer.names = ["tar", "gzip"]
Here are configuration parameters:
decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]
It specifies an array of decomposer name to be used in chupa-text
command. You can use glob pattern for decomposer name such as
"*zip"
. "*zip"
matches "zip"
, "gzip"
and so on.
The default is ["*"]
. It means that all installed decomposers are
used.
mime_types["<extension>"] = "<MIME type>"
It specifies a map to a MIME type from path extension.
Here is an example that maps "html"
to "text/html"
:
mime_types["html"] = "text/html"
Th default configuration file registers popular MIME types.