Thursday, 28 May 2015

Data Scraping Services - Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.

Yelp Scraping Python Script

import urllib from bs4 import BeautifulSoup import re from threading import Thread #List of yelp urls to scrape url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg'] i=0 #function that will do actual scraping job def scrape(ur): html = urllib.urlopen(ur).read() soup = BeautifulSoup(html) title = soup.find('h1',itemprop="name") saddress = soup.find('span',itemprop="streetAddress") postalcode = soup.find('span',itemprop="postalCode") print title.text print saddress.text print postalcode.text print "-------------------" threadlist = [] #making threads while i<len(url): t = Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for b in

threadlist: b.join()

import urllib

from bs4 import BeautifulSoup

import re

from threading import Thread

 #List of yelp urls to scrape

url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

 i=0

#function that will do actual scraping job

def scrape(ur):

           html = urllib.urlopen(ur).read()

          soup = BeautifulSoup(html)

       title = soup.find('h1',itemprop="name")

          saddress = soup.find('span',itemprop="streetAddress")

          postalcode = soup.find('span',itemprop="postalCode")

          print title.text

          print saddress.text

          print postalcode.text

          print "-------------------"

 threadlist = []

#making threads

while i<len(url):

          t = Thread(target=scrape,args=(url[i],))

          t.start()

          threadlist.append(t)

          i=i+1

for b in threadlist:

          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday, 26 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Improving performance for web scraping code

2 down vote favorite

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites), extracts the relative part and then renders the html page.

What should I do to get an optimal performance. Should I change from shared hosting (godaddy) to my own server or it has nothing to do with my hosting and I need to make changes to my code?

1 Answer

API/CSV

Ask those websites if they provide an API, or, if you don't need an up-to-date information or the information you need doesn't change frequently, if they can sell/give you for free the data itself (for example in an CSV file). Some small websites may have fancier ways to access data, like a CSV file for the older information, and an RSS feed for the changed one.

Those websites would probably be happy to help you, since providing you with an API would reduce their own CPU and bandwidth usage by you.

Profile

Screen scrapping is really ugly when it comes to performance and scaling. You may be limited by:

    your machine performance, since parsing, sometimes an invalid HTML file, takes time,

    your network speed,

    their network speed usage, i.e. how fast can you access the pages of their website depending on the restrictions they set, like the DOS protection and the number of requests per second for screen scrappers and search engine crawlers,

    their machine performance: if they spend 500 ms. to generate every page, you can't do anything to reduce this delay.

If, despite your requests to them, those websites cannot provide any convenient way to access their data, but they give you a written consent to screen scrape their website, then profile your code to determine the bottleneck. It may be the internet speed. It may be your database queries. It may be anything.

For example, you may discover that you spend too much time finding with regular expressions the relevant information in the received HTML. In that case, you would want to stop doing it wrong and use a parser instead of regular expressions, then see how this improve the performance.

You may also find that the bottleneck is the time the remote server spends generating every page. In this case, there is nothing to do: you may have the fastest server, the fastest connection and the most optimized code, the performance will be the same.

Do things in parallel:

Remember to use parallel computing wisely and to always profile what you're doing, instead of doing premature optimization, in hope that you're smarter than the profiler.

Especially when it comes to using network, you may be very surprised. For example, you may believe that making more requests in parallel will be faster, but as Steve Gibson explains in episode 345 of Security Now, this is not always the case.

Legal aspects

Also note that screen scrapping is explicitly forbidden by the conditions of use (like on IMDB) on many websites. And if nothing is said on this subject in conditions of use, it doesn't mean that you can screen scrape those websites.

The fact that the information is available publicly on the internet doesn't give you the right to copy and reuse it this way neither.

Why? you may ask. For two reasons:

    Most websites are relying on advertisement and marketing. When people use one of those websites directly, they waste some CPU/network bandwidth of the website, but in response, they may click on an ad or buy something sold on the website. When you screen scrape, your bot waste their CPU/network bandwidth, but will never click on an ad or buy something.

    Displaying the information you screen scrapped on your website can have even worse effects. Example: in France, there are two major websites selling hardware. The first one is easy and fast to use, has a nice visual design, better SEO, and in general is very well done. The second one is a crap, but the prices are lower. If you screen scrape them and give the raw results (prices with links) to your users, they will obviously click on the lower price every time, which means that the website with pretty design will have less chances to sell the products.

    People made an effort in collecting, processing and displaying some data. Sometimes they paid to get it. Why would they enjoy seeing you pulling this data conveniently and for free?

Source: http://programmers.stackexchange.com/questions/141403/improving-performance-for-web-scraping-code/141406#141406

Friday, 22 May 2015

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.

Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these tips could help solve some of your challenges

1. Cache pages visited for scraping

When scraping big websites, its always a good idea to cache the data you have already downloaded. So you don’t have to put load on the website again, in case you have start over again or that page is required again during scraping. Its effortless to cache to a key value store like redis.But Databases and filesystem caches are also good.

2. Don’t flood the website with large number of parallel requests , take it slow

Big websites posses algorithms to detect webscraping, large number of parallel requests from the same ip address will identify you as a Denial Of Service Attack on their website, and blacklist your IPs immediately. A better idea is to time your requests properly one after the other, giving it some human behavior. Oh!.. but scraping like that will take you ages. So balance requests using the average response time of the websites, and play around with the number of parallel requests to the website to get the right number.

3. Store the URLs that you have already fetched

You may want to keep a list of URLs you have already fetched, in a database or a key value store. What would you do if you scraper crashes after scraping 70% of the website. If you need to complete the remaining 30%, with out this list of URLs, you’ll waste a lot of time and bandwidth. Make sure you store this list of URLs

somewhere permanent, till you have all the required data. This could also be combined with the cache. This way, you’ll be able to resume scraping.

4. Split scraping into different phases

Its easier and safer if you split the scraping into multiple smaller phases. For example, you could split scraping a huge site into two. One for gathering links to the pages from which you need to extract data and another for downloading these pages to scrape content.

5. Take only whats required

Don’t grab or follow every link unless its required. You can define a proper navigation scheme to make the scraper visit only the pages required. Its always tempting to grab everything, but its just a waste of bandwidth, time and storage.

Source: http://learn.scrapehero.com/5-tips-for-scraping-big-websites/

Wednesday, 20 May 2015

How Web Data Extraction Services Impact Startups

Starting a business has its fair share of ebbs and flows – it can be extremely challenging to get a new business off the blocks, and extremely rewarding when everything goes according to plan and yields desired results. For startups, it is important to get the nuances of running a business right from day one. To succeed in an immensely competitive space, startups need to perform above and beyond expectation right from the start, and one of the factors that can be of great help during the growing years of a startup is web data extraction.

Web data extraction through crawling and scraping, a highly efficient information gathering process, can be used in many creative ways to bring about major change in the performance graph of a startup. With effective web data extraction services acquired by outsourcing to a reputed company, the business intelligence gathered and the numerous possibilities associated with it, web crawling and extraction services can indeed become the difference maker for a startup, propelling it to the heights of success.

What drives the success of web data extraction?

When it comes to figuring out the perfect, balanced web data collection methodology for startups, there are a lot of crucial factors that come into play. Some of these are associated with the technical aspects of data collection, the approach used, the time invested, and the tools involved. Others have more to do with the processing and analysis of collected information and its judicious use in formulating strategies to take things forward.

Web Crawling Services & Web Scraping Services

With the advent of highly professional web data extraction services providers, massive amounts of structured, relevant data can be gathered and stored in real time, and in time, productively used to further the business interests of a startup. As a new business owner, it is important to have a high-level knowledge of the modern and highly functional web scraping tools available for use. This will help to utilize the prowess of competent data extraction services. This in turn can assist both in the immediate and long-term revenue generation context.

Web Data Extraction for Startups

From the very beginning, the dynamics of startups is different from that of older, well-established businesses. The time taken by the new business entity in proving its capabilities and market position needs to be used completely and effectively. Every day of growth and learning needs to add up to make a substantial difference. In this period, every plan and strategy, every execution effort, and every move needs to be properly thought out.

In such a trying situation where there is little margin for error, it pays to have accurate, reliable, relevant and actionable business intelligence. This can put you in firm control of things by allowing you to make informed business decisions and formulate targeted, relevant and growth oriented business strategies. With powerful web crawling, the volume of data gathered is varied, accurate and relevant. This data can then be studied minutely, analyzed in detail and arranged into meaningful clusters. With this weapon in your arsenal, you can take your startup a long way with smart decisions and clever implementations.

Web data extraction is a task best handled by professionals who have had rich experience in the field. Often, in-house web scraping teams are difficult to assemble and not economically viable to maintain, especially for startups. For a better solution, you can outsource your web scraping needs to a reliable web data extraction service for data collection. This way, you can get all the relevant intelligence you need without overstraining your workforce or having to employ additional personnel to handle web scraping. The company you outsource your work to can easily scrape data from multiple sources as per your requirements, and furnish you with actionable business intelligence that can help you take a lead in a competitive market.

Different Ways for Startups to use Web Data Extraction

Web scraping can be employed for many different purposes to yield different kinds of relevant data that generate actionable insights. For a startup, the important decision is how to use this powerful technique to provide valuable information that can make a difference for the future prospects of the company. Here are some interesting possibilities when it comes to impactful web data extraction for startups –

Fishing for Social Rankings and Backlinks

One of the most important business processes for a startup is competition analysis. This is one area where web data extraction can come across as an invaluable enabler. In the past, many startups have effectively used web scraping to fish for backlinks and social rankings related to competing companies.

Backlinks are important to reach a greater mass of better-targeted audiences, which can go on to increase customer base with minimal efforts. Social ranking is also an immensely important factor, as social actions on the internet are building blocks of opinion and reputation generation in this day and age. Keeping this in mind, you can use web data extraction to scrape for social rankings and backlinks related to content generated by your competing companies. After careful analysis, it is possible to arrive at concrete conclusions regarding what your competitors are doing well, and what sells the best.

This information is gold for marketers and sales personnel, and can be used to discern exactly what needs to be done to increase social buzz, generate favorable opinion, and win over customers from your competitors. You can also use this technique to develop high authority backlinks that help with SEO, targeted reach and organic traffic for your business website. For competition analysis, web scraping is a formidable tool.

Sourcing Contact Information

Another important aspect of business that startups can never ignore is good networking. Whether it is with customers, prospective customers, industry peers, partners, or competitors, excellent networking and open, transparent communication is essential for the success of your startup. For effective communication and networking, you need a large, solid list of contact information pertaining to your exact requirements.

Scraping data from multiple web sources gives you the perfect method of achieving this. With automated, fast web scraping, you can in a short time collect a wealth of important contact information that can be leveraged in many different ways. Whether it is the formation of lasting business relationships or making potential customers aware of what you have on offer, this information has the power to propel your startup to new levels of recognition.

For Ecommerce

If you sell your products and services online and want to stay on top of the competition when it comes to variety, pricing analysis, and special deals and offers, web scraping is the way to go. For many e-commerce startups, the problem of high CTR and low conversion is a stumbling block to higher bottom lines. To remedy problems like these and to ensure better sales, it is always a good idea to have a clear insight about your competition.

Future of Retail Industry

With web data extraction, you can be always aware of what competing companies are doing in terms of pricing strategies, product diversity and special customer offers. By considering that information while evaluating and cementing your own strategies, you can always ensure that you provide better value and range of products and services than your competitors, and therefore stay ahead of the competition.

For Marketing, Brand Promotion and Advertisement

For startups, the first wave of promotion and marketing is the one that holds the key to your long-term business success. It is during this phase that the first and most important public perception of your company is formed, and the rudiments of public opinion start taking shape. For this reason, it is crucial to be on point with your marketing and promotion during the early, formative years of your business.

To achieve this, you need a clear, in-depth understanding of your target audience. You need to categorize your target audience on the basis of many factors like age, gender, demographics, income groups and tastes and preferences. Such detailed understanding can only be possible when you have a large wealth of social data pertaining to your target audience. There is no better way of achieving this than by web data extraction.

Love your brand

With the help of data extraction services, you can gather large chunks of relevant data regarding your target audience which can help you accurately evaluate the potential of each prospective customers as a possible addition to your business family. To ensure that you have a steady, early wave of customers to take your business off the blocks at a rapid pace, you need to devise marketing campaigns, promotional strategies and advertisements in accordance with the customer knowledge you drive through your web scraping efforts. This is a foolproof strategy to have marketing and promotional plans in place that achieve goals, bring in new business and provide your company with enough initial momentum to carry it through the later years of success.

To conclude, web data extraction can be a veritable tool in the hands of a startup. With the proper use and leveraging of this technique, your startup can gather the required business intelligence to shine in a competitive market and become a favorite with the customer base. Working with the right web data extraction company can be one of the most important business decisions you make as a startup owner.

Source: https://www.promptcloud.com/blog/web-data-extraction-services-for-startups/

Sunday, 17 May 2015

Metadata Scraping Service

As mentioned in Robert's last blog post we set up a scraping service which supports users working with citations by extracting automatically references from digital library or publisher websites. We use a very similar service in BibSonomy to support our users while posting a new reference. However, the service is independent from BibSonomy. Our main goal is to make the metadata of other websites easily accessible to every user who needs bibliographic metadata. Therefore we offer the extracted information in BibTeX format. Most tools allow to import BibTeX so it should be very easy for everyone to get the data into his own tool. The service is running under the following URL:

http://scraper.bibsonomy.org/

Currently we support more than 60 different websites (here the full list) and we are working on further extensions. In the near future we will make the source code of our scrapers publicly available under GPL and we hope that other people will find it useful and start to help us by implementing their own scrapers.

How does the service work?

In principle there are two ways to use the service. One uses a so

called bookmarklet and the other is simply based on the URL. If you

have a webpage of a supported site e.g. from ACM digital library the

following page:

Logsonomy - social information retrieval with logdata

then you can copy this URL into the form on the service homepage and the service will return you the extracted BibTeX information. As this is not a very convenient way to access the data we provide a ScrapePublication button. This button is a small piece of JavaScript and can be copied to the toolbar of the browser. By pressing this button while visiting a digital library webpage the URL will be automatically copied and sent to the scraping service and the metadata is extracted.

The service has three options which can be used to customize it and to make it useful for other systems. Obviously one parameter is the URL itself which is used by the bookmarklet, too. The next is the selection parameter which allows to send text to the service and the last parameter allows to change the output format from html to plain BibTeX. This last parameter makes integration with other systems very simple.

If needed we can provide the metadata in other formats as well but currently we support only BibTeX.

Source: http://blog.bibsonomy.org/2008/11/metadata-scraping-service.html

Wednesday, 6 May 2015

Web Scraping: Startups, Services & Market

I got recently interested in startups using web scraping in a way or another and since I find the topic very interesting I wanted to share with you some thoughts. [Note that I’m not an expert. To correct me / share your knowledge please use the comment section]

Web scraping is everything but a new technique. However with more and more data shared on internet (from user generated content like social networks & review websites to public/government data and the growing number of online services) the amount of data collected and the use cases possible are increasing at an incredible pace.

We’ve entered the age of “Big Data” and web scraping is one of the sources to feed big data engines with fresh new data, let it be for predictive analytics, competition monitoring or simply to steal data.

From what I could see the startups and services which are using “web scraping” at their core can be divided into three categories:

•    the shovel sellers (a.k.a we sell you the technology to do web scraping)

•    the shovel users (a.k.a we use web scraping to extract gold and sell it to our users)

•    the shovel police (a.k.a the security services which are here to protect website owners from these bots)

The shovel sellers

From a technology point of view efficient web scraping is quite complicated. It exists a number of open source projects (like Beautiful Soup) which enable anyone to get up and running a web scraper by himself. However it’s a whole different story when it has to be the core of your business and that you need not only to maintain your scrapers but also to scale them and to extract smartly the data you need.

This is the reason why more and more services are selling “web scraping” as a service. Their job is to take care about the technical aspects so you can get the data you need without any technical knowledge. Here some examples of such services:

    Grepsr
    Krakio
    import.io
    promptcloud
    80legs
    Proxymesh (funny service: it provides a proxy rotator for web scraping. A shovel seller for shovel seller in a way)
    scrapingHub
    mozanda

The shovel users

It’s the layer above. Web scraping is the technical layer. What is interesting is to make sense of the data you collect. The number of business applications for web scraping is only increasing and some startups are really using it in a truly innovative way to provide a lot of value to their customers.

Basically these startups take care of collecting data then extract the value out of it to sell it to their customers. Here some examples:

Sales intelligence. The scrapers screen marketplaces, competitors, data from public markets, online directories (and more) to find leads. Datanyze, for example, track websites which add or drop javascript tags from your competitors so you can contact them as qualified leads.

Marketing. Web scraping can be used to monitor how your competitors are performing. From reviews they get on marketplaces to press coverage and financial published data you can learn a lot. Concerning marketing there is even a growth hacking class on udemy that teaches you how to leverage scraping for marketing purposes.

Price Intelligence. A very common use case is price monitoring. Whether it’s in the travel, e-commerce or real-estate industry monitoring your competitors’ prices and adjusting yours accordingly is often key. These services not only monitor prices but with their predictive algorithms they can give you advice on where the puck will be. Ex: WisePricer, Pricing Assistant.

Economic intelligence, Finance intelligence etc. with more and more economical, financial and political data available online a new breed of services, which collect and make sense of it, are rising. Ex: connotate.

The shovel police

Web scraping lies in a gray area. Depending on the country or the terms of service of each website, automatically collecting data via robots can be illegal. Whatever the laws say it becomes crucial for some services to try to block these crawlers to protect themselves. The IT security industry has understood it and some startups are starting to tackle this problem. Here are 3 services which claim to provide solutions to stop bots from crawling your website:

•    Distil
•    ScrapeSentry
•    Fireblade

From a market point of view

A couple of points on the market to conclude:

•    It’s hard to assess how big the “web scraping economy” is since it is at the intersection of several big industries (billion dollars): IT security, sales, marketing & finance intelligence. This technique is of course a small component of these industries but is likely to grow in the years to come.

•    A whole underground economy also exists since a lot of web scraping is done through “botnets” (networks of infected computers)

•    It’s a safe bet to say that more and more SaaS (like Datanyze pr Pricing Assistant) will find innovative applications for web scraping. And more and more startups will tackle web scraping from the security point of view.

•    Since these startups are often entering big markets through a niche product / approach (web scraping is a not the solution to everything, there are more a feature) they are likely to be acquired by bigger players (in the security, marketing or sales tools industries). The technological barrier are there.

Source: http://clementvouillon.com/article/web-scraping-startups-services-market/