The Text Searcher module enables conducting searches in content of textual files, MS-Word, MS-Excel and textual PDF documents or OCR output files. This way, extensive search options are added without being confined to merely the indexing data, but also using the document’s content. Searches can be conducted with the purpose of finding documents containing required textual words or phrases. The search syntax is similar to the one employed by Google for web searches. Searches can be conducted combining parameters of data from the system’s database with textual searches. For instance, a search for the entire documents related to a certain customer containing the word “Order”.
The modules employs the to Apache Lucene ™ core, high-performance, full-featured text search engine library written entirely in Java. It is Scalable, high-Performance Indexing engine:
- over 150GB/hour on modern hardware
- small RAM requirements — only 1MB heap
- incremental indexing as fast as batch indexing
- index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms:
- ranked searching — best results returned first
- many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
- fielded searching (e.g. title, author, contents)
- sorting by any field
- multiple-index searching with merged results
- allows simultaneous update and searching
- flexible faceting, highlighting, joins and result grouping
- fast, memory-efficient and typo-tolerant suggests
- pluggable ranking models, including the Vector Space Model and Okapi BM25
- configurable storage engine (codecs)
ADA System Bar Code Module reads over 30 different barcode types with high speed and accuracy, from numerous image file formats, thus automatically receiving the relevant indexing data of a document or a documents group. The module enables recognition of the bar code in any position on the page; this feature is enabling recognition of several bar codes on the same page. In addition, the bar code can act as a separator between multi page documents scanned in one batch.
Features:
– Report confidence values for detected barcodes.
– Receive more accurate decoding of barcodes.
– Eliminate false positives when reading patch codes to minimize size.
– Identify and recognize barcodes anywhere on the page, in any orientation, in milliseconds.
– Handles damaged and poorly printed or scanned barcode images.
– Broken or damaged barcodes are handled automatically.
Supported image colors:
24-bit color images, 8-bit grayscale images, 1-bit black and white images
Supported format:
Add-2, Add-5, Airline 2 of 5, Australia Post 4-State Code, BCD Matrix, Codabar, Code 128 (A,B,C), Code 2 of 5, Code 32, Code 39, Code 39 Extended, Code 93, Code 93 Extended, DataLogic 2 of 5, EAN 128 (GS1, UCC), EAN-13, EAN-8, GS1 , DataBar , Industrial 2 of 5, Intelligent Mail (OneCode), Interleaved 2 of 5, Invert 2 of 5, ITF-14 / SCC-14, Matrix 2 of 5, Patch Codes, PostNet, Royal Mail (RM4SCC), UCC 128, UPC-A, UPC-E
ADA System OCR Module recognizes texts from scanned documents. After recognition, the Text Searcher Module can be deployed to search for documents by their content. In addition, it is possible to detect certain predefined areas in the scanned forms for extracting of data.
Speed and Reliability
– Large volume document batch processing
– Single and multi-page documents
Accuracy
– Automatically detect, segment, and recognize multiple languages on the same document
– Detects font characteristics (font-family name, style, size, bold, italic, underline, strikeout, slope angle, etc.)
– Spell checking dictionary support
– Full-page analysis and Zonal recognition
Automatic document cleanup
– Omni-directional noise removal
– Undither text
– Dot matrix correction
Automatic document preprocessing
– Deskew of scanned document
– Detect and correct the orientation of the page (flipped or reversed)
– Remove borders
– Split pages
Supports more than 40 languages, including:
English (en), Afrikaans (af), Albanian (sq), Arabic (ar), Azerbaijani (az), Basque (eu), Belarusian (be), Bulgarian (bg), Catalan (ca), Chinese Simplified (zh-Hans), Chinese Traditional (zh-Hant), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), Estonian (et), Faroese (fo), Finnish (fi), French (fr), Galician (gl), German (de), Greek (el), Hungarian (hu), Icelandic (is), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latvian (lv), Lithuanian (lt), Macedonian (mk), Malay (ms), Maltese (mt), Norwegian (no), Polish (pl), Portuguese (pt), Portuguese Brazil (pt-BR), Romanian (ro), Russian (ru), Serbian (sr), Serbian Cyrillic (sr-Cyrl-CS), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Vietnamese (vi)