Parsing a Web page

The next thing I tried was parsing a web page using BeautifulSoup and NLTK.

There is more information on the Internet than any human can absorb in a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it. For this, we use web scraping. Web scraping automatically extracts data and presents it in a format you can easily make sense of. In this tutorial, we’ll focus on its applications in the financial market, but web scraping can be used in a wide variety of situations.

The first step was as always, installing the libraries. So, I installed BeautifulSoup and urllib to start building my scraper. One of the key things to take note of was that to build a scraper you need a basic knowledge of what an html page looks like because these scrapers, use the html tags on a website to return text.

There are also certain rules that need to be followed when building a web scraper.


  1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
  2. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
  3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed
Next step is to inspect the page you want to scrape. This will give you an idea of under what tag is the information in on the webpage. Once we have the text from the website, we can perform tokenization and POS tagging. 

The github link: 
https://github.com/jayeetaroy/Text-analytics-Python/blob/master/scraper%20test.ipynb

Comments

Popular posts from this blog

Finally built a text classifier - Part 1

Linear Regression Theory