RSS
Latest News
Donation

Balabolka Text Extract Utility

The utility allows to extract text from the various types of files. The extracted text can be combined into one file or/and split into few files. The list of rules for pronunciation correction from Balabolka can be applied to text.

Supported formats for input files: AZW, AZW3, CHM, DOC, DOCX, EPUB, FB2, HTML, MHT, MOBI, ODT, PDB, PDF, PRC, RTF, TCR, TXT, WPD. The IFilter interface will be used for files with unknown extensions.

The utility works from the command line, without displaying any user interface. This is useful to integrate the text processing options to other applications, for example.

Execution order of operations:

  1. Extract text from input file(s).
  2. Format text: remove spaces, linebreaks, etc. (if options are specified).
  3. Combine files into one file (if option is specified).
  4. Split text (if options are specified).
  5. Apply rules for pronunciation correction (if option is specified).
  6. Save output file(s).

Download Balabolka (Text Extract Utility)


Size: KB
 
Version:
 
Licence: Freeware
 
Operating System:
Command Line

The utility handles various command line parameters to be able to extract text from files. The command line options use the syntax "blb2txt [options ...]", all parameters must be separated by a space. Options can appear in any order on the command line so long as they are paired with their related parameters. Use the "blb2txt -?" command line to get help on the command line syntax and parameters.


-f file_mask
Sets the name of input file or the mask for the group of input files. The command line may contain few options -f.
-v folder_name
Sets the name of output folder for saving of text files.
-p filename
Sets the pattern for output file name (for example, "Text Document"). If absent, the input file name will be used. Use the %FirstLine% variable to insert the first line of text to the output file name.
-i
Reads data from STDIN. The file format will be auto-detected from data. If the option is specified, the option -f is ignored.
-o
Writes text to STDOUT. If the option is specified, the options -v and -p are ignored.
-u
Combines text files into one output file.
-b
Adds sequence number before output file name.
-a
Adds sequence number after output file name.
-n integer
Sets the starting sequence number for output files. The default is 1.
-e encoding
Sets the encoding for output files ("ansi", "utf8" or "unicode"). The default is "ansi".
-t integer
Splits text by output target file size (in kilobytes).
-k keyword
Splits text by special keyword in input file. The option is case-sensitive. The command line may contain few options -k.
-r keyword
Splits text by keyword and removes it from output files. The option is case-sensitive. The command line may contain few options -r.
-w
Splits text by two empty lines in succession.
-l
Splits text by lines where all letters are capital.
-d file_name
Uses a dictionary for pronunciation correction (*.REX or *.DIC). The command line may contain few options -d.
-if
Uses IFilter interface to extract text. If this fails, the default method will be used by the application.
-pwd text
Sets the password for the encrypted PDF files.
-? or -h
Prints the list of available command line options.
--remove-spaces
Removes excess spaces (two or more blank spaces in succession, no-break spaces).
--remove-hyphens
Removes hyphens at the ends of lines in the text.
--remove-linebreaks
Removes linebreaks inside paragraphs.
--remove-empty-lines
Removes empty lines.
--replace-empty-lines
Replaces few empty lines by one empty line.
--remove-square-brackets
Removes text in [square brackets].
--remove-curly-brackets
Removes text in {curly brackets}.
--remove-angle-brackets
Removes text in <angle brackets>.
--fix-ocr-errors
Fixes OCR errors (for languages with Cyrillic alphabets only).



Examples

Extract text from BOOK.DOC and save as "New Book.txt":

blb2txt -f "d:\Docs\book.doc" -v "d:\Text\" -p "New Book"



Extract text from the Microsoft Word and RTF documents, remove empty lines and save text files in UTF-8 encoding:

blb2txt -f "d:\Docs\*.doc" -f "d:\Docs\*.rtf" -v "d:\Text\" -e "utf8" --replace-empty-lines



Extract text from all files in the specified folder, unite and save as "Document.txt":

blb2txt -f "d:\Docs\*.*" -v "d:\Text\" -p "Document" -u



Extract text from 1.DOC, divide on parts with size 100 KB and save as text files "Document 20.txt", "Document 21.txt", etc.:

blb2txt -f "d:\Docs\1.doc" -v "d:\Text\" -p "Document" -a -n 20 -t 100



Extract text from BOOK.FB2, find the words "CHAPTER" and "CONTENTS" to divide text on parts and save as files with the names "Book 1.txt", "Book 2.txt", etc.:

blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "Book" -k "CHAPTER" -k "CONTENTS"



Extract text from BOOK.EPUB, find "###" to divide text on parts, remove "###" from text and save each part as a new file:

blb2txt -f "d:\Book\book.epub" -v "d:\Text\" -p "Book" -r "###"



Get text from STDIN, remove excess spaces, linebreaks and empty lines, write the updated text to STDOUT:

blb2txt -i -o --remove-spaces --remove-linebreaks --replace-empty-lines



Configuration File

The command line options can be stored as a configuration file "blb2txt.cfg" in the same folder as the utility.

The sample configuration file:

-f d:\Docs\*.rtf
-f d:\Books\*.epub
-f d:\Books\*.fb2
-v d:\Text
-b
-n 1
-t 25
-e utf8
-d d:\rex\rules.rex
-d d:\dic\rules.dic
--remove-spaces
--remove-linebreaks
--replace-empty-lines

The utility may combine options from the configuration file and the command line.




Licence

You are free to use and distribute software for noncommercial purposes. For commercial use or distribution, you need to get permission from the copyright holder.