node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
-
Updated
Oct 5, 2022 - HTML
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Use the Java Tika text extraction library on the .NET platform
Multiple and Large PDF Documents Text Extraction.
Twitter text processing library (auto linking and extraction of usernames, lists and hashtags). Based on the Ruby and Java implementations by Matt Sanford
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Read pdf files on javascript
C# and VB.NET samples for Docotic.Pdf library
R Interface to Apache Tika
Build search across multiple documents client-side in your file storage
simple rule based named entity recognition
An R package to extract text from pdf.
A collection of tools for OCR (optical character recognition).
Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python
VNDB explorer and VNR-like text hooker.
view pdf on X11 and the Linux framebuffer; resize pdf; convert pdf to text, html, TeX, groff
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension ��. Alas Extracts text 💪.
Add a description, image, and links to the extract-text topic page so that developers can more easily learn about it.
To associate your repository with the extract-text topic, visit your repo's landing page and select "manage topics."