Optimal Scraping Technique: CSS Selector, XPath, & RegEx

Web scraping deals with HTML almost exclusively. In nearly all cases, what is required is a small sample from a very large file (e.g. pricing information from an ecommerce page). Therefore, an essential part of scraping is searching through an HTML document and finding the correct information. How that should be done is the matter of some debate, preferences, experience, and types of data. While all scraping and parsing methods are “correct”, some of them have benefits that may be […]

Read more

Regex Cheatsheet For Natural Language Processing tasks

This article was published as a part of the Data Science Blogathon Introduction Regex is a shorthand for Regular Expression. It is a representation for a set, a set of strings. Say we have a list of emails and we want to check if they are in the correct format or not. One way is to check each and every mail manually but that’s not possible if the number of mails is quite high. So, regex here comes to your rescue. […]

Read more

Part 13: Step by Step Guide to Master NLP – Regular Expressions

This article was published as a part of the Data Science Blogathon Introduction This article is part of an ongoing blog series on Natural Language Processing (NLP). From this article, we will start our discussion on Regular Expressions. When a data scientist comes across a text processing problem whether it is searching for titles in names or dates of birth in a dataset, regular expressions rear their ugly head very frequently. They form part of the basic techniques in NLP and […]

Read more

How to Get Started with NLP – 6 Unique Methods to Perform Tokenization

Overview Looking to get started with Natural Language Processing (NLP)? Here’s the perfect first step Learn how to perform tokenization – a key aspect to preparing your data for building NLP models We present 6 different ways to perform tokenization on text data   Introduction Are you fascinated by the amount of text data available on the internet? Are you looking for ways to work with this text data but aren’t sure where to begin? Machines, after all, recognize numbers, […]

Read more

Extracting information from reports using Regular Expressions Library in Python

Introduction Many times it is necessary to extract key information from reports, articles, papers, etc. For example names of companies – prices from financial reports, names of judges – jurisdiction from court judgments, account numbers from customer complaints, etc. These extractions are part of Text Mining and are essential in converting unstructured data to a structured form which are later used for applying analytics/machine learning. Such entity extraction uses approaches like ‘lookup’, ‘rules’ and ‘statistical/machine learning’. In ‘lookup’ based approaches, […]

Read more

FlashText – A library faster than Regular Expressions for NLP tasks

People like me working in the field of Natural Language Processing almost always come across the task of replacing words in a text. The reasons behind replacing the words may be different. Some of them are. “would’ve” and “would have” represent the same thing. So changing all the occurrences of “would’ve” to “would have” is one such task. Changing all Case Variations to a single form i.e Python, pytHon, pYthon, pythoN etc. to python Changing all the synonyms of a word to […]

Read more

Beginners Tutorial for Regular Expressions in Python

Importance of Regular Expressions In last few years, there has been a dramatic shift in usage of general purpose programming languages for data science and machine learning. This was not always the case – a decade back this thought would have met a lot of skeptic eyes! This means that more people / organizations are using tools like Python / JavaScript for solving their data needs. This is where Regular Expressions become super useful. Regular expressions are normally the default way […]

Read more

Introduction to Regular Expressions in Python

In this tutorial we are going to learn about using regular expressions in Python, including their syntax, and how to construct them using built-in Python modules. To do this we’ll cover the different operations in Python’s re module, and how to use it in your Python applications. What are Regular Expressions? Regular expressions are basically just a sequence of characters that can be used to define a search pattern for finding text. This “search engine” is embedded within the Python […]

Read more

Using Regex for Text Manipulation in Python

Introduction Text preprocessing is one of the most important tasks in Natural Language Processing (NLP). For instance, you may want to remove all punctuation marks from text documents before they can be used for text classification. Similarly, you may want to extract numbers from a text string. Writing manual scripts for such preprocessing tasks requires a lot of effort and is prone to errors. Keeping in view the importance of these preprocessing tasks, the Regular Expressions (aka Regex) have been […]

Read more

Comparing Strings using Python

In Python, strings are sequences of characters, which are effectively stored in memory as an object. Each object can be identified using the id() method, as you can see below. Python tries to re-use objects in memory that have the same value, which also makes comparing objects very fast in Python: $ python Python 2.7.9 (default, Jun 29 2016, 13:08:31) [GCC 4.9.2] on linux2 Type “help”, “copyright”, “credits” or “license” for more information. >>> a = “abc” >>> b = […]

Read more