Web Scraping Amazon Using Python



  1. Python Web Scraping Tools
  2. How To Scrape Amazon Using Python
  • Related Questions & Answers

Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways. Some of the practical applications of web scraping could be: Gathering resume of. Web scraping is one of the tools at a developer’s disposal when looking to gather data from the internet. While consuming data via an API has become commonplace, most of the websites online don’t have an API for delivering data to consumers. The basic idea of web scraping is that we are taking existing HTML data, using a web scraper to identify the data, and convert it into a useful format. The end stage is to have this data stored as either JSON, or in another useful format.

  • Selected Reading

Quote Guessing Game using Web Scraping in Python. Scraping Amazon Product Information using Beautiful Soup. This article demonstrates how you can employ Python and Selenium to scrape modern websites that typically can’t be scraped using traditional methods due to the presence of more advanced.

PythonServer Side ProgrammingProgramming

We can extract content in web pages from a variety of domains such as data mining, information retrieval etc. To extract information from the websites of newspapers and magazines we are going to use newspaper library.

The main purpose of this library is to extract and curates the articles from the newspapers and similar websites.

Installation:

  • To Newspaper library installation, run in your terminal:

  • For lxml dependencies, run below command in your terminal

  • To install PIL, run

  • The NLP corpora will be downloaded:

Python Web Scraping Tools

The python newpaper library is used to collect information associated with articles. This includes author name, major images in the article, publication dates, video present in the article, key words describing the article and the summary of the article.

Output:

Output:

Below is the complete program:

Output:

In the last few years, we saw a great shift in technology, where projects are moving towards “microservice architecture” vs the old 'monolithic architecture'. This approach has done wonders for us.

As we say, “smaller things are much easier to handle”, so here we have microservices that can be handled conveniently. We need to interact among different microservices. I handled it using the HTTP API call, which seems great and it worked for me.

But is this the perfect way to do things?

The answer is a resounding, 'no,' because we compromised both speed and efficiency here.

Then came in the picture, the gRPC framework, that has been a game-changer.

What is gRPC?

Quoting the official documentation-
gRPC or Google Remote Procedure Call is a modern open-source high-performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication.”


Credit: gRPC

RPC or remote procedure calls are the messages that the server sends to the remote system to get the task(or subroutines) done.

Google’s RPC is designed to facilitate smooth and efficient communication between the services. It can be utilized in different ways, such as:

  • Efficiently connecting polyglot services in microservices style architecture
  • Connecting mobile devices, browser clients to backend services
  • Generating efficient client libraries

Why gRPC?

- HTTP/2 based transport - It uses HTTP/2 protocol instead of HTTP 1.1. HTTP/2 protocol provides multiple benefits over the latter. One major benefit is multiple bidirectional streams that can be created and sent over TCP connections parallelly, making it swift.

- Auth, tracing, load balancing and health checking - gRPC provides all these features, making it a secure and reliable option to choose.

- Language independent communication- Two services may be written in different languages, say Python and Golang. gRPC ensures smooth communication between them.

- Use of Protocol Buffers - gRPC uses protocol buffers for defining the type of data (also called Interface Definition Language (IDL)) to be sent between the gRPC client and the gRPC server. It also uses it as the message interchange format.

Let's dig a little more into what are Protocol Buffers.

Protocol Buffers

Protocol Buffers like XML, are an efficient and automated mechanism for serializing structured data. They provide a way to define the structure of data to be transmitted. Google says that protocol buffers are better than XML, as they are:

  • simpler
  • three to ten times smaller
  • 20 to 100 times faster
  • less ambiguous
  • generates data access classes that make it easier to use them programmatically

Protobuf are defined in .proto files. It is easy to define them.

Types of gRPC implementation

1. Unary RPCs:- This is a simple gRPC which works like a normal function call. It sends a single request declared in the .proto file to the server and gets back a single response from the server.

CODE: https://gist.github.com/velotiotech/d2938c90ee7948186e7a3848f3558577.js

2. Server streaming RPCs:- The client sends a message declared in the .proto file to the server and gets back a stream of message sequence to read. The client reads from that stream of messages until there are no messages.

CODE: https://gist.github.com/velotiotech/0bdb7a50673c97745b37995a83f74ba3.js

3. Client streaming RPCs:- The client writes a message sequence using a write stream and sends the same to the server. After all the messages are sent to the server, the client waits for the server to read all the messages and return a response.

CODE: https://gist.github.com/velotiotech/757cef3a558b6ffbd38ff6eee37ab8ab.js

4. Bidirectional streaming RPCs:- Both gRPC client and the gRPC server use a read-write stream to send a message sequence. Both operate independently, so gRPC clients and gRPC servers can write and read in any order they like, i.e. the server can read a message then write a message alternatively, wait to receive all messages then write its responses, or perform reads and writes in any other combination.

CODE: https://gist.github.com/velotiotech/3e64bbe6b9e15c13feb31b2204f27ec0.js

**gRPC guarantees the ordering of messages within an individual RPC call. In the case of Bidirectional streaming, the order of messages is preserved in each stream.

Implementing gRPC in Python

Currently, gRPC provides support for many languages like Golang, C++, Java, etc. I will be focussing on its implementation using Python.

CODE: https://gist.github.com/velotiotech/bb3daedb9e213985122dde02190653ac.js

This will install all the required dependencies to implement gRPC.

Unary gRPC

For implementing gRPC services, we need to define three files:-

  • Proto file - Proto file comprises the declaration of the service that is used to generate stubs (<package_name>_pb2.py and <package_name>_pb2_grpc.py). These are used by the gRPC client and the gRPC server.</package_name></package_name>
  • gRPC client - The client makes a gRPC call to the server to get the response as per the proto file.
  • gRPC Server - The server is responsible for serving requests to the client.

CODE: https://gist.github.com/velotiotech/28d88d9bbf29c86e0f548cb73eeaa965.js

In the above code, we have declared a service named Unary. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a MessageResponse. Below the service declaration, I have declared Message and Message Response.

Once we are done with the creation of the .proto file, we need to generate the stubs. For that, we will execute the below command:-

CODE: https://gist.github.com/velotiotech/bc5fbd828ba23019161c8fd25566f1da.js

Two files are generated named unary_pb2.py and unary_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and the client.

Web Scraping Amazon Using Python

Implementing the Server

CODE: https://gist.github.com/velotiotech/3e6812a7277cc765dde2e4c77a707a67.js

In the gRPC server file, there is a GetServerResponse() method which takes `Message` from the client and returns a `MessageResponse` as defined in the proto file.

server() function is called from the main function, and makes sure that the server is listening to all the time. We will run the unary_server to start the server

How To Scrape Amazon Using Python

CODE: https://gist.github.com/velotiotech/8d067e1d1ae747b03121255492bde7af.js

Implementing the Client

CODE: https://gist.github.com/velotiotech/75f6f2f53e722db2a7343c03782a74aa.js

In the __init__func. we have initialized the stub using ` self.stub = pb2_grpc.UnaryStub(self.channel)’ And we have a get_url function which calls to server using the above-initialized stub

This completes the implementation of Unary gRPC service.

Let's check the output:-

Run -> python3 unary_client.py

Output:-

Python web scraping tools

message: 'Hello Server you there?'

message: 'Hello I am up and running. Received ‘Hello Server you there?’ message from you'

received: true

Bidirectional Implementation

CODE: https://gist.github.com/velotiotech/bbabd8c23f18d1da0c480339de226eb7.js

In the above code, we have declared a service named Bidirectional. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a Message. Below the service declaration, I have declared Message.

Once we are done with the creation of the .proto file, we need to generate the stubs. To generate the stub, we need the execute the below command:-

CODE: https://gist.github.com/velotiotech/b33906ac7adb8a51311b58f952ff8cd8.js

Two files are generated named bidirectional_pb2.py and bidirectional_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and client.

Implementing the Server

CODE: https://gist.github.com/velotiotech/81b63c1a92f23b9c4478d09433a2f281.js

In the gRPC server file, there is a GetServerResponse() method which takes a stream of `Message` from the client and returns a stream of `Message` independent of each other. server() function is called from the main function and makes sure that the server is listening to all the time.

We will run the bidirectional_server to start the server:

CODE: https://gist.github.com/velotiotech/11e327c95e9fed1fb1be84357ee0566a.js

Implementing the Client

CODE: https://gist.github.com/velotiotech/ad7b026cad3b523de876cf131d52d4d2.js

In the run() function. we have initialised the stub using ` stub = bidirectional_pb2_grpc.BidirectionalStub(channel)’

And we have a send_message function to which the stub is passed and it makes multiple calls to the server and receives the results from the server simultaneously.

This completes the implementation of Bidirectional gRPC service.

Let's check the output:-

Run -> python3 bidirectional_client.py

Output:-

Hello Server Sending you the First message

Hello Server Sending you the Second message

Hello Server Sending you the Third message

Hello Server Sending you the Fourth message

Hello Server Sending you the Fifth message

Hello from the server received your First message

Hello from the server received your Second message

Hello from the server received your Third message

Hello from the server received your Fourth message

Hello from the server received your Fifth message

For code reference, please visit here.

Conclusion

gRPC is an emerging RPC framework that makes communication between microservices smooth and efficient. I believe gRPC is currently confined to inter microservice but has many other utilities that we will see in the coming years. To know more about modern data communication solutions, check out this blog.