Extracting information from reports using Regular Expressions Library in Python

Introduction

Many times it is necessary to extract key information from reports, articles, papers, etc. For example names of companies – prices from financial reports, names of judges – jurisdiction from court judgments, account numbers from customer complaints, etc.

These extractions are part of Text Mining and are essential in converting unstructured data to a structured form which are later used for applying analytics/machine learning.

Such entity extraction uses approaches like ‘lookup’, ‘rules’ and ‘statistical/machine learning’. In ‘lookup’ based approaches, words from input documents are searched against pre-defined data dictionary. In ‘rules’ based approach, pattern searches are made to find key information. Whereas in ‘statistical’ approach supervised-unsupervised methods are used to extract the information.

Regular expression (RegEx)’ is one of the ‘rules’ based pattern search method.

 

Basic syntax

Python supports regular expressions by the library called “re”(though it’s not fully Perl-compatible). Instead of regular strings, search patterns are specified using raw strings “r”, so that backslashes and meta characters are not interpreted by python but sent to RegEx directly.

Go through

 

 

 

To finish reading, please visit source site

Leave a Reply