bulletproof
Reconnaissance Expert
2
MONTHS
2 2 MONTHS OF SERVICE
LEVEL 1
400 XP
Introduction:
This tutorial will teach you how to make a web scraper in C#, .NET framework.
Theory:
Here are the steps we will follow;
Get webpage source
Disect source
Output results
Getting the Source:
So first we need to get the web page source. Our target URL is going to be the home page of sourcecodester.com. First we create a basic HTTPWebRequest to the site, we then receive the response, and read it to a string which we return to the calling location of the function...
Disectting the Source:
Now that we have the source, we want to disect. As a side note; here is what the main function where we are calling everything from looks like...
So first we want to look for patterns in the source. You can either save the webpage in your page and open the saved documents in a text editor on your PC, or you can use a file stream to save the httpresponse from our program.
Looking at the source, we can see that all the articles are surrounded by divs with the class of ''. About three classes in to the div we can see that the one I have selected is a 'node-book', there are other types such as 'source-code' so we are going to use the classes that are used in all the articles only;
"Outputting the Results:
All done, now we can simply output the resulting containers...
Of course, this was just a simple demonstration; we could then disect the information further and extract the titles and other pieces of information from the divs.
Finished!
This tutorial will teach you how to make a web scraper in C#, .NET framework.
Theory:
Here are the steps we will follow;
Get webpage source
Disect source
Output results
Getting the Source:
So first we need to get the web page source. Our target URL is going to be the home page of sourcecodester.com. First we create a basic HTTPWebRequest to the site, we then receive the response, and read it to a string which we return to the calling location of the function...
- static
string
getSource(
)
- {
- HttpWebRequest req =
(
HttpWebRequest)
WebRequest.
Create
(
"http://www.sourcecodester.com/"
)
;
- req.
UserAgent
=
"curl"
;
// this simulate curl linux command
- req.
Method
=
"GET"
;
- HttpWebResponse res =
(
HttpWebResponse)
req.
GetResponse
(
)
;
- req =
null
;
- return
new
StreamReader(
res.
GetResponseStream
(
)
)
.
ReadToEnd
(
)
;
- }
Disectting the Source:
Now that we have the source, we want to disect. As a side note; here is what the main function where we are calling everything from looks like...
- static
void
Main(
string
[
]
args)
{
- string
src =
getSource(
)
;
- }
So first we want to look for patterns in the source. You can either save the webpage in your page and open the saved documents in a text editor on your PC, or you can use a file stream to save the httpresponse from our program.
Looking at the source, we can see that all the articles are surrounded by divs with the class of ''. About three classes in to the div we can see that the one I have selected is a 'node-book', there are other types such as 'source-code' so we are going to use the classes that are used in all the articles only;
"Outputting the Results:
All done, now we can simply output the resulting containers...
- foreach
(
string
s in
articles)
{
- Console.
WriteLine
(
s)
;
- }
Of course, this was just a simple demonstration; we could then disect the information further and extract the titles and other pieces of information from the divs.
Finished!