Balabolka :: Programa para extrair texto do arquivo

O programa é projetado para extrair textos de arquivos de diferentes formatos. O texto extraído pode ser montado em um único arquivo e / ou distribuídos em vários arquivos. Ao texto podem ser aplicadas as regras dos dicionários de correção da pronúncia do programa Balabolka.

São suportados os seguintes formatos de arquivo: AZW, AZW3, CHM, DjVu, DOC, DOCX, EML, EPUB, FB2, FB3, HTML, LIT, MD, MHT, MOBI, ODP, ODS, ODT, PDB, PDF, PPT, PPTX, PRC, RTF, TCR, TXT, TXTZ, WPD, WRI, XLS, XLSX.

O aplicativo não tem interface gráfica e funciona em modo texto. A modalidade do programa pode ser configurada através da linha de comando ou um arquivo de configuração.

O programa cumpre operações na seguinte ordem:

Extrair texto do arquivo.
Formatar o texto: remove os espaços extras, quebras de linha, etc. (se tais opções são especificadas).
Reunir textos num único arquivo (se tal opção é especificada).
Dividir o texto em partes (se tais opções são especificadas).
Aplicar as regras de pronúncia correcta (se tais opções são especificadas).
Salvar o arquivo ou os arquivos no disco.

Baixar o programa BLB2TXT

Tamanho: MB

Versão: Registro de alterações

Licença: Freeware

Sistema operacional:

Linha de comando

O programa pode ser configurado através da linha de comando. As opções são separadas por um espaço e começam com "-" (hífen). É possível obter uma lista completa de opções a partir da linha de comando digitando blb2txt.exe com opções -? ou -h.

-f nome_do_arquivo

Nome do arquivo ou uma máscara para os nomes dos arquivos que é extraído a partir do texto. A linha de comando pode conter várias opções -f.

-fl nome_do_arquivo

Sets the name of the text file with the list of input files (one file name per line).

-v nome_da_pasta

Nome da pasta para salvar o arquivo com o texto extraído.

-p texto

Modelo para o arquivo com o texto extraído (por exemplo, "Documento de texto"). Se não for especificado o nome será usado o nome do arquivo de origem.

Use the %FileName% variable to insert the input file name to the output file name.
Use the %FirstLine% variable to insert the first line of text.
Use the %Header% variable to insert the chapter title.
Use the %Number% variable to change the position of the sequence number inside the output file name.

Warning! It is necessary to double a percent sign (%) in a batch script. For example: -p %%Number%%

-ext texto

Set the extension for output filenames. The default is "txt".

-out nome_de_arquivo

Sets the full name for output file. The option is recommended to specify only when the utility is used as a part of other software. If the utility is used for custom document import, the external program runs the utility from a command line and passes the full name of a text file to create.

-s

Search input files in subfolders.

-cf

Create a subfolder for each input file. A file name will be used as a name of an output subfolder.

-i

Inscrever o texto no fluxo da entrada padrão (STDIN). Se especificado, a opção -f é ignorada.

-o

Inscrever o texto no fluxo da saída padrão (STDOUT). Se especificado, as opções -v e -p são ignoradas.

-u

Combinar textos de múltiplos arquivos num arquivo único.

-b

Adicionar o número de série antes do nome do arquivo.

-a

Adicionar o número de série após o nome do arquivo.

-n número

Estabelecer o número de série inicial do arquivo. O padrão é 1.

-e codificação

Codificação do arquivo para extrair o texto ("ansi", "utf8" ou "unicode"). O valor padrão é "ansi".

-t número

Especificar o modo de partição do texto: utilização de um tamanho prescrito do arquivo. The number corresponds to an amount of characters.

-k palavra_chave

Especificar o modo de partição do texto: busca de palavras-chave do texto no arquivo de origem. Este parâmetro é sensível a maiúsculas. A linha de comando pode conter várias opções -k.

-r palavra_chave

Dividir o texto com a palavra-chave e remové-la do texto. Esta opção é sensível a maiúsculas. A linha de comando pode conter várias opções -r.

-w

Especificar o modo de partição do texto: encontrar duas linhas em branco uma após outra.

-l

Especificar o modo de partição do texto: buscar o texto onde todas as letras são maiúsculas.

-c

Splits text by a table of contents. The application extracts positions of chapter beginnings from the input file (if the file contains such information).

-toc

Generates a table of contents and splits text. The application splits the extracted text by keywords (like "chapter" or "volume"). If the option is used together with the option -c, the application will try to extract a table of contents from the document; if it fails, a new table of contents will be generated.

-m número

Sets the minimal size of text parts for splitting (as a number of characters).

-j número

Ignores the chapter beginning if the size of the previous chapter is less than the specified value (in characters). Esta opção é utilizada em conjunto com -c ou -toc.

-hh texto

Inserts text in front of headings (for example: ## Chapter 1).

-d nome_do_arquivo

Usar um dicionário para a pronúncia correcta (arquivo com extensão *.BXD, *.REX ou *.DIC). A linha de comando pode conter várias opções -d.

-if

Uses IFilter interface to extract text. If this fails, the default method will be used by the application.

-g nome_da_pasta

Sets the name of output folder for saving of images from documents.

-cvr nome_da_pasta

Sets the name of output folder for saving of a book cover image.

-cft

Clones the Created/Modified/Accessed time of the input file into the output file. If the application combines text files or splits the extracted text, the option is ignored.

-x file_type

Sets the input file type. It allows to define a format of input documents with unknown file name extensions. For example: -x doc.

-pwd texto

Definir uma senha para extrair o texto do ficheiro no formato de PDF.

-dll file_name

Sets the path and name for 7z.dll (32bit). This library helps to extract text and images from documents inside archives (ZIP, RAR, etc.). 7z.dll is a part of 7-Zip software. If the option is not specified, the application and the library must be in the same folder; otherwise, the application will not be able to extract data from archive files.

-dex file_types

Sets the list of file types for extracting from archives. The option contains a comma-separated list of file types, for example: -dex "fb2,epub"
The command line may contain few options -dex. If the option is not specified, the application will extract text from all files in an archive. If it is necessary to extract text for all file types supported by the application, use the value "all-". For example: -dex all-

-dne file_types

Sets the list of file types to ignore when documents are extracted from archives. The option contains a comma-separated list of file types, for example: -dne "exe,dll"
The command line may contain few options -dne. If the option is not specified, the application will extract text from all files in an archive.

-dp

Display progress information in a console window.

-cfg nome_do_arquivo

Sets the name of the configuration file with the command line options (a text file where each line contains one option). If the option is not specified, the file blb2txt.cfg in the same folder as the utility will be used.

-h

Mostrar a descrição da linha de comando.

--remove-spaces ou -rs

Eliminar espaços em branco (dois ou mais espaços um após outro, espaços inseparáveis).

--remove-hyphens ou -rh

Eliminar traços nas extremidades das linhas do texto.

--remove-linebreaks ou -rl

Eliminar quebras de linha dentro de um parágrafo.

--remove-empty-lines ou -rm

Eliminar todas as linhas vazias.

--replace-empty-lines ou -rp

Substituir múltiplas linhas brancas por uma linha em branco.

--remove-square-brackets ou -rsb

Eliminar o texto em [colchetes].

--remove-curly-brackets ou -rcb

Eliminar o texto em {chavetas}.

--remove-angle-brackets ou -rab

Eliminar o texto entre <sinais de menor e maior>.

--remove-round-brackets ou -rrb

Removes text in (round brackets).

--remove-comments ou -rc

Removes comments. Single-line comments start with // and continue until the end of the line. Multiline comments start with /* and end with */.

--remove-page-numbers ou -rpn

Removes page numbers (it may be useful for DjVu/PDF files).

--fix-ocr-errors ou -ocr

Corrigir erros encontrados em OCR (só para idiomas com alfabeto cirílico).

--fix-letter-spacing ou -ls

Fixes letter-spacing in words (for example: s p a c e, _w_o_r_d).

--add-period ou -ap

Adds a period if there is no punctuation after the last word of the paragraph.

--extract-summary número or -es número

Extracts a summary (also called "annotation") from FB2/FB3 files and inserts at the beginning of text. Possible values for the integer parameter:

0 - skips a summary (used by default);
1..5 - extracts a summary (a value determines the order in which an author name and a book title are listed).

--skip-notes ou -sn

Skips notes, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.

--include-notes número ou -in número

Includes notes inside text, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.
Possible values for the integer parameter:

0 - removes links to notes from text;
1 - keeps default positions of notes inside text (this value is used by default);
2 - places notes at the end of sentences;
3 - places notes at the end of paragraphs.

--insert-note-begin texto ou -inb texto

Inserts words at the beginning of notes, when notes are included inside text (for example: Editor's note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--insert-note-end texto ou -ine texto

Inserts words at the end of notes, when notes are included inside text (for example: End of note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--extract-tables número ou -et número

Extract tables from DOCX/FB2/FB3/ODT files. Possible values for the integer parameter:

0 - skips tables;
1 - extract data from each cell as a new text line (this value is used by default);
2 - keep formatting when extracting a table.

--csv-comma

Columns are separated by a comma, when the application extracts data from XLS/XLSX/ODS files (default delimiter for CSV files).

--csv-semicolon

Columns are separated by a semicolon, when the application extracts data from XLS/XLSX/ODS files.

--csv-space

Columns are separated by a blank space, when the application extracts data from XLS/XLSX/ODS files.

--csv-tab

Columns are separated by a tab, when the application extracts data from XLS/XLSX/ODS files.

--csv-double-quote

Uses double-quote characters, if a field must be quoted (export from XLS/XLSX/ODS files).

--csv-single-quote

Uses single-quote characters, if a field must be quoted (export from XLS/XLSX/ODS files).

--eml-save nome_da_pasta

Extracts attachments from EML files and saves to a specified folder.

--eml-att

Extracts the list of attachments from EML files (names of files attached to the message).

--eml-cc

Extracts the header field "Cc" from EML files ("carbon copy"; it specifies additional recipients of the message).

--eml-date formato_de_data

Extracts the header field "Date" from EML files (the local time and date when the message was composed and sent). A date format are defined by specifiers (such as "d", "m", "y", etc.). For example: "dd.mm.yyyy hh:nn:ss".

--eml-from

Extracts the header field "From" from EML files (the email address, and optionally the name of the author).

--eml-org

Extracts the header field "Organization" from EML files (the name of the organization through which the sender of the message has net access).

--eml-rt

Extracts the header field "Reply-To" from EML files (the address for replies to go to).

--eml-subj

Extracts the header field "Subject" from EML files (the subject of the message).

--eml-to

Extracts the header field "To" from EML files (the email address, and optionally the name of the message's recipient).

Exemplo de comandos

Exemplo de comandos para iniciar o programa para extrair o texto:

blb2txt -f "d:\Docs\book.doc" -v "d:\Text\"

blb2txt -f "d:\Docs\book.doc" -out "d:\Text\book.txt"

blb2txt -f "d:\Docs\*.doc" -f "d:\Docs\*.rtf" -v "d:\Text\" -e utf8 --replace-empty-lines

blb2txt -f "d:\Docs\*.*" -v "d:\Text\" -p "Documento" -u

blb2txt -f "d:\Docs\1.doc" -v "d:\Text\" -p "Documento" -a -n 20 -t 100000

blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "Livro" -k "CAPÍTULO" -k "Indice"

blb2txt -f "d:\Book\book.epub" -v "d:\Text\" -p "Livro" -r "###"

blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "%Number% - %Header%" -c -j 1024

blb2txt -i -o --remove-spaces --remove-linebreaks --replace-empty-lines

blb2txt -f "d:\Archive\*.zip" -v "d:\Text\" -dll "e:\7-Zip\7z.dll" -dex doc,docx

Arquivo de configuração

É possível salvar o arquivo de configuração "blb2txt.cfg" na mesma pasta que o aplicativo de console.

Um exemplo do conteúdo do arquivo:

-f d:\Docs\*.rtf
-f d:\Books\*.epub
-f d:\Books\*.fb2
-v d:\Text
-b
-n 1
-t 25000
-e utf8
-d d:\Dict\rules.bxd
--remove-spaces
--remove-linebreaks
--replace-empty-lines

O programa pode combinar opções do arquivo de configuração e da linha de comando.

Licença

You are free to use and distribute software for noncommercial purposes. For commercial use or distribution, you need to get permission from the copyright holder.