How does changing an HTTP referrer header help circumvent crawler blocking












3














I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.



A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...




For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com











share|improve this question







New contributor




JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    3














    I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.



    A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...




    For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com











    share|improve this question







    New contributor




    JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      3












      3








      3







      I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.



      A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...




      For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com











      share|improve this question







      New contributor




      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.



      A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...




      For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com








      web-crawlers referrer python request






      share|improve this question







      New contributor




      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 4 hours ago









      JBT

      1162




      1162




      New contributor




      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      JBT is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          1 Answer
          1






          active

          oldest

          votes


















          3














          The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.



          A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.



          The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.





          • No referrers - Gives away that you are a bot.


          • Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.


          • Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.


          • Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.






          share|improve this answer





















            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "45"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            JBT is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fwebmasters.stackexchange.com%2fquestions%2f120022%2fhow-does-changing-an-http-referrer-header-help-circumvent-crawler-blocking%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.



            A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.



            The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.





            • No referrers - Gives away that you are a bot.


            • Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.


            • Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.


            • Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.






            share|improve this answer


























              3














              The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.



              A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.



              The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.





              • No referrers - Gives away that you are a bot.


              • Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.


              • Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.


              • Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.






              share|improve this answer
























                3












                3








                3






                The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.



                A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.



                The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.





                • No referrers - Gives away that you are a bot.


                • Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.


                • Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.


                • Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.






                share|improve this answer












                The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.



                A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.



                The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.





                • No referrers - Gives away that you are a bot.


                • Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.


                • Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.


                • Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 1 hour ago









                Stephen Ostermiller

                66.8k1391244




                66.8k1391244






















                    JBT is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    JBT is a new contributor. Be nice, and check out our Code of Conduct.













                    JBT is a new contributor. Be nice, and check out our Code of Conduct.












                    JBT is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Webmasters Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fwebmasters.stackexchange.com%2fquestions%2f120022%2fhow-does-changing-an-http-referrer-header-help-circumvent-crawler-blocking%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Understanding the information contained in the Deep Space Network XML data?

                    Ross-on-Wye

                    Eastern Orthodox Church