Data Extraction Basics for Docs and Images with OCR and NER

Become a Data Extraction Expert with Python, Pandas, OCR, NER, and Spacy : Learn to Train and Build Real-World Solutions

Ratings 4.85 / 5.00
Data Extraction Basics for Docs and Images with OCR and NER

What You Will Learn!

  • Learn how to extract data from PDFs, Word docs, scanned images, and more with ease.
  • Use Tesseract and PyTesseract to perform optical character recognition (OCR) on images with accuracy.
  • Develop a common pipeline for data extraction from different types of input documents.
  • Learn how to develop a robust data extraction workflow
  • Get started on how to use Spacy efficiently for labelling
  • Learn how to train Spacy for your own data set
  • Use Pandas to convert extracted data to a CSV format
  • Design a customizable technical OCR solution for data extraction

Description

Master Smart Data Extraction from PDF and Images with Python, Pandas, OCR, Tesseract, PyTesseract, OpenCV, Spacy, and NER

Gain a competitive edge in the world of computer vision by learning how to extract data from PDFs and images intelligently. In this comprehensive course, you'll learn how to use a variety of powerful tools and techniques, including:

  • Python: A versatile and widely used programming language for data science and machine learning

  • Pandas: A powerful library for data manipulation and analysis

  • OCR: Optical character recognition, used to convert images of text into machine-readable text

  • Tesseract: A popular open-source OCR engine

  • PyTesseract: A Python wrapper for Tesseract

  • OpenCV: A computer vision library

  • Spacy: A natural language processing (NLP) library

  • NER: Named entity recognition, used to identify and classify named entities in text

You'll also learn how to build a common pipeline for data extraction from different types of input documents, including structured PDF documents, scanned PDF documents, and Word documents. By the end of the course, you'll be able to develop robust data extraction solutions for a variety of real-world applications.


Unique Offerings:

  • Code walkthrough of working pipeline which performs various operations on documents such as conversion, extraction, and labeling

  • Line-by-line code walkthrough of various operations performed at different steps

  • End product that you will build with us towards the end of course is in working condition and support is provided within 24 hours for any issues faced

  • Detailed explanation of steps required to train Spacy for NER


Key Topics:

  • Understanding Data Conversion

  • Conversion and Extraction from structured PDF document

  • Conversion of Scanned PDF document

  • Conversion and Extraction of data from word document

  • Common Format for Pipeline

  • Image Reading using PIL and OpenCV

  • Tesseract for Extraction

  • Tesseract Page Segmentation Mode (PSM) and OCR Engine Mode (OEM)

  • Extraction of Data from Image

  • PyTesseract Operations

  • Named Entity Recognition (NER)

  • Spacy Entity Types

  • IOB Format

  • Labelling with Spacy for NER

  • Training Spacy model on custom data using NER

  • Predicting using Trained Spacy Model

  • Pandas

  • Convert Data to CSV Output


Who Should Attend!

  • Python Developers who need to extract data from various sources for their work.
  • Students who are interested in learning about data extraction and how it can be used to solve real-world problems
  • Anyone who is curious about data extraction and wants to learn more about it.

TAKE THIS COURSE

Tags

  • Natural Language Processing
  • Computer Vision
  • OCR (Optical Character Recognition)

Subscribers

310

Lectures

39

TAKE THIS COURSE



Related Courses