July 6, 2021 Scrape

mlscraper: Scrape data from HTML pages automatically with Machine Learning

mlscraper

mlscraper allows you to extract structured data from HTML automatically with Machine Learning. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you’ll be able to extract data from any new page you provide.

How it works

After you’ve defined the data you want to scrape, mlscraper will:

find your samples inside the HTML DOM
determine which rules/methods to apply for extraction
extract the data for you and return it in a dictionary

    import requests
    from mlscraper import RuleBasedSingleItemScraper

    from mlscraper.training import SingleItemPageSample
    # the items found on the training page

    targets = {

        "https://test.com/article/1": {"title": "One great result!", "description": "Some description"},

        "https://test.com/article/2": {"title": "Another great result!",
 
 

 
To finish reading, please visit source site


		
		
	

		Categories
Categories


	
		
			Search for:
			
		
		
	


		
		Recent Posts
		
											
					Why You Should Attend a Python Conference
									
											
					Getting Started With Google Gemini CLI
									
											
					The Terminal: First Steps and Useful Commands for Python Developers
									
											
					Speeding up NumPy with parallelism
									
											
					Create Callable Instances With Python’s .__call__()
									
					

		
Tags
Attention
blogathon
Calculus
Command-line Tools
Data Preparation
data science
data visualization
Deep Learning
Deep Learning for Computer Vision
Deep Learning for Natural Language Processing
Deep Learning for Time Series
Deep Learning Performance
Deep Learning with PyTorch
Ensemble Learning
Generative Adversarial Networks
Imbalanced Classification
Linear Algebra
Long Short-Term Memory Networks
machine learning
Machine Learning Algorithms
Machine Learning Process
Machine Learning Resources
machine translation
Matplotlib
Natural language processing
Natural Language Processing & Speech
Neural MT
nlp
NMT
opencv
Optimization
pandas
Probability
python
Python for Machine Learning
Python Machine Learning
Resources
R Machine Learning
scikit-learn
sentiment analysis
Start Machine Learning
Statistics
Time Series
Weka Machine Learning
XGBoost
Categories
Categories

Archives
		Archives


	
	
		

	
	
				
		
		
			
				
								
				
					
	
		Powered by WordPress and Rubine.