Webcrawlers copying my site

I recently created a site that would serve as a personal blog. I noticed that if I google my site’s name, I get an exact copy of it, hosted by someone else.

I’m guessing this is the action of a bot that does this on a large scale for phishing purposes.

That leads to the question: since this bot is copying any files I host on my site and hosting them themselves, does that introduce any vulnerabilities on their end? How would I go about looking for such vulnerabilities?

What is a good way to defend against these sort of bots? Is setting up a honeypot and ip banning an effective strategy?

4 Likes

robots.txt ? antibots ?

2 Likes

I set up a robots.txt to look like this:
User-agent: *
Disallow: /about

I don’t mind search engines crawling my site to index it, I just don’t want it to be copied. Is there a way robots.txt can achieve that?

2 Likes

I will leave CAPTCHA as a last resort.

I’ve looked at the access logs and there’s quite a few requests on there. I’m certain they are all bots, it’s a new site, and I don’t expect any real visitors.
Some examples of requests these requests are:

  • GET /ab2g
  • “\x0Bm\xAC\x86{\xD8\xF4\xE7\x19\xA5\xA1n” → What is this even?
  • GET /.env
  • "GET /admin/phpinfo.php

I’ve added a little javascript snippet at the top of the site that checks if the domain matches expected and brings up a message if not. The bots copied it as-is. This is what their site looks like now: https://live.bayeq.xtuplecloud.com/

I don’t want to disable right clicking, and I’m not sure it will deter the bots, it’s likely their method of copying does not depend on right clicking.

I will set up Mitchell Krozga’s “ultimate bot blocker”, and see if that helps.

The main thing I’m curious about is if this approach of blindly copying all the files opens them up for attack. For example, if they copy a keylogging application and somehow are exploited to run it. I can’t think of any way of executing such an attack, but I’m curious if anyone else has any ideas.

3 Likes

You could just add password to the site with the prompt “The password is: _________”.

#https://stackoverflow.com/a/44698584/2901077
function run(){
  var password = prompt("Password is fuckbots");
  if(password != 'fuckbots'){
    document.body.innerHTML = '';
    document.body.innerHTML = 'Password Failed! Reload to Renter Password';
  }else{
    alert('Success');
  }
}
run();
3 Likes

Oh just seen this post and wanted to suggest ultimate bad blocker but I see you found it. I am using it in many projects with Nginx and it works great. In blacklist-ips.conf you can add IP clone server and in custom-bad-referrers.conf a domain. It works pretty well.

You can also block image hotlinking in website conf, for example:

location ~ .(gif|png|jpe?g)$ {
   valid_referers none blocked website.com *.website.com;
   if ($invalid_referer) {
      return 403;
   }
}

sometimes it broke clone site (I had one situation like this)

but ultimate bad bot blocker is amazing and should be enough.

Analyze you web server logs and check if this is some kind of custom bot with custom user agent, then if custom you can also add it to blacklist-user-agents.conf.

And from time to time search for clones and add them to blacklist.

Let us know if you manage it.

Default configuration protects pretty well from script kiddies, standard tools and well known bots.

3 Likes

I installed the ultimate bad bot blocker. I still get see sketchy requests in the logs, but fewer than before I think. Looks like installing fail2ban and setting some rules would help, I’ll spend some time learning that next.

I like the garlic idea of encoding all the content and decoding it after the DOMContentLoaded event fires. I won’t use it for now, but I’ll keep it in mind.

Thanks for the tips!

1 Like

Here’s some iptables rules if you want as well. I’m making log parser scripts just slowly lol.

This topic was automatically closed after 121 days. New replies are no longer allowed.