• We just launched and are currently in beta. Join us as we build and grow the community.

Creating a Web Page Scraper in C#

bulletproof

Reconnaissance Expert
B Rep
0
0
0
Rep
0
B Vouches
0
0
0
Vouches
0
Posts
81
Likes
198
Bits
2 MONTHS
2 2 MONTHS OF SERVICE
LEVEL 1 400 XP
Introduction:
This tutorial will teach you how to make a web scraper in C#, .NET framework.

Theory:
Here are the steps we will follow;
Get webpage source
Disect source
Output results

Getting the Source:
So first we need to get the web page source. Our target URL is going to be the home page of sourcecodester.com. First we create a basic HTTPWebRequest to the site, we then receive the response, and read it to a string which we return to the calling location of the function...

  1. static

    string

    getSource(

    )
  2. {
  3. HttpWebRequest req =

    (

    HttpWebRequest)

    WebRequest.

    Create

    (

    "http://www.sourcecodester.com/"

    )

    ;
  4. req.

    UserAgent

    =

    "curl"

    ;

    // this simulate curl linux command
  5. req.

    Method

    =

    "GET"

    ;
  6. HttpWebResponse res =

    (

    HttpWebResponse)

    req.

    GetResponse

    (

    )

    ;
  7. req =

    null

    ;
  8. return

    new

    StreamReader(

    res.

    GetResponseStream

    (

    )

    )

    .

    ReadToEnd

    (

    )

    ;
  9. }

Disectting the Source:
Now that we have the source, we want to disect. As a side note; here is what the main function where we are calling everything from looks like...

  1. static

    void

    Main(

    string

    [

    ]

    args)

    {
  2. string

    src =

    getSource(

    )

    ;
  3. }

So first we want to look for patterns in the source. You can either save the webpage in your page and open the saved documents in a text editor on your PC, or you can use a file stream to save the httpresponse from our program.

Looking at the source, we can see that all the articles are surrounded by divs with the class of ''. About three classes in to the div we can see that the one I have selected is a 'node-book', there are other types such as 'source-code' so we are going to use the classes that are used in all the articles only;
"Outputting the Results:
All done, now we can simply output the resulting containers...

  1. foreach

    (

    string

    s in

    articles)

    {
  2. Console.

    WriteLine

    (

    s)

    ;
  3. }

Of course, this was just a simple demonstration; we could then disect the information further and extract the titles and other pieces of information from the divs.

Finished!

 

452,496

327,690

327,698

Top