i
Mind WaveAI Solutions Pvt Ltd
1 Mind WaveAI Solutions Pvt Ltd Python Developer Job
5-8 years
Hyderabad / Secunderabad
2 vacancies
Python Developer PDF Table Extraction (Open-Source OCR & AI)
Mind WaveAI Solutions Pvt Ltd
posted 22d ago
Fixed timing
Key skills for the job
We are seeking a skilled **Python Developer** with expertise in extracting **unstructured tables from PDF documents** using **open-source models**. The ideal candidate should have hands-on experience with **OCR, deep learning, and NLP techniques** to accurately process and structure tabular data.
### **Key Responsibilities:**
- Develop and implement **Python-based solutions** to extract tables from **unstructured PDFs**.
- Utilize **open-source libraries** like **pdfplumber, Tesseract OCR, Camelot, Tabula, PyMuPDF**, and deep learning-based models.
- Handle **complex table structures, multi-page tables, and merged cells** effectively.
- Preprocess PDFs, including **noise reduction, skew correction, and text enhancement**.
- Use AI/ML models (e.g., **Detectron2, LayoutLM, Donut OCR, or Graph Neural Networks**) for intelligent table extraction.
- Optimize the accuracy and reliability of extracted data through **post-processing techniques**.
- Ensure **scalability, performance, and error handling** for large document processing.
- Work with **structured storage solutions** like **Pandas, SQL, or JSON** for extracted data.
- Collaborate with teams to **integrate the solution into an existing pipeline or API**.
### **Required Skills:**
✅ **Strong Python skills** (NumPy, Pandas, OpenCV, TensorFlow/PyTorch).
✅ **Experience with OCR tools** (Tesseract, EasyOCR, PaddleOCR).
✅ **PDF processing libraries** (pdfplumber, PyMuPDF, Camelot, Tabula).
✅ **Deep Learning models** for document understanding (Detectron2, LayoutLM, Donut OCR).
✅ **Preprocessing techniques** (denoising, deskewing, contour detection).
✅ **Experience with NLP and Computer Vision for text segmentation**.
✅ Knowledge of **data extraction, transformation, and validation techniques**.
✅ Familiarity with **Docker, API integration, and cloud storage solutions**.
### **Preferred Skills (Bonus):**
🔹 Experience in **Graph Neural Networks (GNN) for table structure detection**.
🔹 Working knowledge of **Hugging Face transformers for document AI**.
🔹 Familiarity with **LLMs for intelligent document parsing (LlamaIndex, LangChain)**.
### **Project Goal:**
Develop an **end-to-end open-source solution** that accurately extracts and structures tables from **scanned and text-based PDFs** without using paid services like **AWS Textract, Google Vision, or Azure Form Recognizer**.
Employment Type: Part Time
Read full job descriptionPrepare for Python Developer roles with real interview advice
5-8 Yrs
Hyderabad / Secunderabad