Hacker News new | past | comments | ask | show | jobs | submit login
“Magic links” can end up in Bing search results, rendering them useless (medium.com/ryanbadger)
537 points by rsbadger on June 27, 2022 | hide | past | favorite | 233 comments



I've had to deal with this with e-mail verification links and Auth0. The user clicked the link after getting it in their mailbox but then Auth0 throws up an error page because the e-mail address has already been verified (by Outlook scanning). The problem becomes worse if for some reason the mail ends up in the junk mail folder so the user thinks they've never received the mail but when you check it looks like the e-mail address was verified successfully. That has caused a lot of annoying back and forth trying to figure out what the hell is going on. We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified. Super annoying.


Links like this are stupid regardless of Outlook's behaviour because they require a perfectly reliable client and network and user in a perfectly undisturbed flow. If I can't F5, if I double-click, if my mouse is wonky, my wifi is bad, my power goes out, my computer hangs, my DSL dies just after a click, if I accidentally close the tab.. there are any of a thousand reasons why abusing GET for a one-time-use page or redirect is horribly wrong.

It takes incredible arrogance to continue using them in order to "improve usability" given all the obvious and common cases where they completely destroy usability. The difficulty for a provider to verify they aren't sending you to a phishing or browser 0day page barely scratches the surface.


The only purpose of this link was to verify that the email address is valid. Once it’s verified, you can login.


I have seen services where you have to click a link every time you want to log in


They are called magic links... only thing magic about them is their ability to annoy me


I think they exist to simplify the flow for the subset of users who end up using the Reset Password link each time their session expires.

And I think that subset is much larger than some would expect.


This. You'd be amazed how many users just do a password reset each time to login instead of remembering their login info.


My father has insisted on doing this for over 20 years, but he doesn't know how to do it himself. I expect a password-reset phone call from him every 2 or 3 days and have done since 1998. Just recently he had someone from his bank's IT department call him directly about resetting his password over 500 times.


I'm not sure if he's still doing it but someone put together https://theuserisdrunk.com/ and https://theuserismymom.com/ a few years back... I wonder if you could do something similar here, given the level of absolute predictability that seems to be involved.

I sadly can't put my finger on what's so compelling about this, just that my "oh that person should talk to a UX team lead!" meter just went plink


Or "passwordless" login, and I love it. Not many people use password managers and will reuse passwords between websites (I.e. their bank and some random unsecured SaaS product). One-time emailed passwords are an easy way to avoid this problem and have a fairly secure site (mind you, it's only as secure as their email). You can layer 2FA on top of this too.

It's only annoying if the site is constantly timing you out so that every single visit you need to resend. Why not just use secure cookies to remember the user for say a week?


>They are called magic links... only thing magic about them is their ability to annoy me

I love them and prefer them to creating yet another account with a password.


Me too!


I had a similar case recently where I was getting the magic link in an email on my phone, and needed to copy it into Slack so I could click it on the laptop I wanted to actually log in on

This... was impossible to do, because by long-pressing on iOS to get the Copy prompt, iOS also goes ahead and opens a preview of the link next to it


Haha, I was in a restaurant and you paid through your phone. My browser updated so it closed between the thank you page and the payment click. State was lost so the thank you page was broken. The restaurant didn’t think I paid but my bank account said otherwise (this was a bank transfer via ideal, not credit card). Getting out of there without paying twice was entertaining.


>We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified.

That's a yikes from me! So I can sign up on your service as anyone with an Outlook account, without verification?


Facebook has allowed this in the past and someone recently opened an unverified Instagram account with my address.


I'd assume the custom page has a random URL and requires entering the email address (requiring a match) or clicking a button. I've seen some account confirmation pages like that.


Discord let someone sign up with my gmail email address, sent an email verification link, and before I saw either the "welcome to Discord" or "please verify your email" links they'd already let the person in as me. I don't know if this is because of google crawling links from mail or some other kind of failure, but I wasn't pleased that Discord would let someone impersonate me.


No it would still require username/password. This was only verifying the email address was correct.


Wow! I think you just figured out an issue I had while working in a previous company using Auth0, where the authentication token would expire before the user had actually gone there (so the user saw an error page when clicking), but on our side it looked like the user went there but dropped off directly after. Had maybe 1% of the users complaining about this, but we never found the root-cause (we moved to our own authentication before we could figure it out). This has to have been why. Thanks for sharing this!


Yw. Took us ages to figure out


HTTP GET requests are supposed to be idempotent, meaning that when you call an URL twice it should not lead to any different result compared to calling it once. This is part of the HTTP standard.

So while I think what Outlook does here is wrong, what these webpages do is simply a bug that should be fixed and shows a lack of understanding of HTTP.


> shows a lack of understanding of HTTP

I think that's a bit too much. Nothing in that suggests that they are breaking anything in the HTTP specification. You're right that GET requests has to be idempotent, but the exchange from the single-time use code you get in email with the API token, is most likely behind a non-GET request (like POST). The HTTP server responds to GET requests with the static assets (HTML/CSS/JS), but then the static assets has JavaScript that calls the POST endpoint for the exchange.

At least that's my guess. I agree it's a bug on their side, and they should fix it. But I think it's more of a UX issue than breaking the protocol.


Agree, the commentor is hung up on their demonstrably superior understanding of HTTP.

We found at least one scanning service to be fetching the URL with the user agent of a browser, and executing JavaScript on the page.

A lot of our user interactions could be simpler, but this sort of behaviour led to many things being put behind a "go" button.

I only wonder how long before scanners and search engines start clicking these buttons to activate content on the page so they can scan/index it.


It's idempotent. Your account won't get un-verified if you open the verification link twice.

Idempotent != side-effect free


GET is supposed to be safe as well as idempotent though.

Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. (RFC 7231, § 4.2.1)


It's not though. Loading the page twice creates two different outcomes. Idempotent endpoints can't.

!Idempotent != Reversal of changes


What are the two different outcomes?

1. You open the link once = your account is verified

2. You open the link twice = your account is still verified

???


If it's a magic link for logging in:

1. You open the link once = you're logged in

2. You open the link twice = you're presented with an error that the magic link has already been used.


I wouldn't go around saying others don't understand http. Puts you in a very awkward position when you are wrong. Which, you are. Idempotency and safety are separate concepts related to http. But, you do you. Take the advice however you want.


IMHO it is clearly a wrong assumption on the side of any such sender. A verification link should have clear definite actions for the user receiving it:

- It's me, let me confirm my address

- I never signed up for this heap of diamonds

Whenever I (not even a bot) click or follow a link from my mailbox, by accident or on purpose, I don't expect that to validate an account for anyone else, but me, intentionally, using a password I know.


Honestly that's why I built the email verification page so the user still has to click a button on that page.


I had a customer who had some sort of software that followed the link in the email we sent (no big deal so far), and THEN would follow every link and button on that page.

We had a handy quick decline and accept button on there so they were auto declining things…

I didn’t hate email until I got into web development….


The crawler followed buttons on forms? And sent POST requests? Yikes..


Yup. It was a pain. I have no idea who wrote that and thought it was a good idea…

It was a super basic web form too. Probably the most html markup standard thing we have. Nothing strange about it that could have triggered some sort of strange behavior.


Having decline or accept buttons may not be enough.

How about having a input field asking for the email again to double check?


GET requests are supposed to be idempotent


Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks. If Bing has not seen the web page before and it's not in the Bing dangerous web page index it first needs to check it to make a determination of if it's a phishing/malware page by scanning/indexing it before returning that outcome back to Outlook to flag the email as dangerous.


> Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks

The problem with that is that the logic is broken. Microsoft cannot possibly know all phishing sites, especially for smaller things. By obfuscating the link the user can no longer verify it by themselves without clicking, but Microsoft will say it's safe. So the user is left with a false sense of security and are worse off.

It only works for huge sites ( e.g. mytwitter.lol phishing for twitter and similar), but drastically lowers the chance of less high profile phishing being caught.


The problem with that is that the logic is broken. If 99.99% of phishing can be prevented this way, what problem do you have with it? Would you really catch that 0.01% that an automated system wouldn't?


You mean you don't verify calls to action via other information channels? Fairly regularly I get phishing emails that correctly spoof the crypto headers of major sites (e.g., because of a misconfigured mail service). If an email asks me to do something, it either doesn't get done or I cover my ass in as many ways as possible, no exceptions.

That isn't by itself an argument against a good automated system -- I definitely like not having to sift through most of that garbage, but catching the 0.01% should be a routine practice, not something that seems like an insurmountable burden.


Maybe you should implement a "feature" that serves a simple static HTML page <p>This webpage is safe.</p> to "bingbot" and serve the real page to everyone else.


I believe the recommended practice is to hover over the URL before clicking the link.

If you do so, in Outlook, there will be a pop that shows "Original URL: XXX". This allows users to make a determination for themselves whether the link is safe or not.


We got some security courses about that too. Unfortunately, outlook replaces all of them with some safelink url rewriting, so the only way left to find out if a link is scammy is clicking it.


It is in fact possible to extract a destination URL from a Safelink one without clicking it. For the full link this can be tedious, but identifying the domain can still be done quickly.


For normal URLs, I agree. But in this case you have adversarial urls. Suppose the scammer puts some http and www.google.com in the url parameters, after some randomly generated 8 characters dot someobscuretld site.

I don't trust myself enough to be 100% sure I can decode an URLencoded misleading mess perfectly all the time.

They already hid urls in the username of the url, like www.google.com.unholymessherethatscrollsoutoftheurlbar @ malignantdomainnotgoogle.blah


Scammy Microsoft.


Microsoft offers this as a security product. It's impossible to know all links but known ones can be blocked to limit future issues. Other enterprise email security products scan the links and follow all the redirects as well. After delivery a incredibly small amount of time and every link is "clicked" in an email with those products.


Cool, so we should just stop building and running almost anything in existence because it's not all-encompassing? That sounds like a suboptimal path forward.


Scanning something for malware and publishing it in search results seem like 2 completely different things to me...?


But there is nothing to indicate either in the post or in the referenced SO thread that the URLs are published to the search results. They are visited by bingbot, that much seems confirmed, but there’s no example where one of these results shows up in the public search results.


They were indexed in Bing results, I’ve shared the URL of that in this thread


Holy shit thats bad. Do unlisted youtube and gdrive share links get indexed through this?


You’d assume those have proper robots.txt configuration?


I have a disallow all robots.txt for a production system. Have had from the beginning.

Bing indexes it. This is my first major security incident and I have no idea how to fix this without making everything totally shitty for the users.


Some services ignore global disallow, but will respect rules explicitly targeted at them.


I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.


Yes, but there is no indication they are publishing it in the search results.

The original post is just complaining that the malware scanning is visiting the links.

They come to the following conclusion

>This effectively makes all one-time use links like login/pass-reset/etc useless.

Which we all know is not true because sites like onetimesecret.com allow for entering a separate password to prevent this sort of thing when it does happen.

It would be an interesting discussion to talk about what Microsoft's whitelisting process looks like, but the original article doesn't seem to understand what is going on well enough to drive the conversation in that direction.


They are publishing them - it has bitten us (e.g. expired one click links for customers ending up on Bing from their emails)


I think this is where we use meta tags.

All pages with one click links should have no index follow or no index no follow. Your seo consultant (if you have one) should have advised you on this.

I am not saying this excuses the privacy violation but just suggesting there are things we can do...


Bing appears to be ignoring those headers for links crawled from emails


Worse would be links that are private to the people who posses the url. Like a private video on YouTube or a private document in google docs. The security depends on the URL being secret. This would silently publish secret information.


If those pages have no proper meta tags or robots.txt, there’s absolutely nothing wrong with this. Security by obscurity was never a good approach; from Proxies to security scanners, there has always been software that crawls unassuming URLs and published the results somewhere, if only a report to the admin.


robots.txt disallow is ignored for my production site at least. This is super bad.


Same for us - we have robots.txt disallow etc. and the relevant headers for personal customer links and Bing is ignoring and publishing all the same


If you can say for certain that the links being published are coming from the malware scanning, and not being taken from users' browser sessions that are using Microsoft Edge you should elaborate on this.


I would be pretty mortified if browsers were using user browser sessions to scan content and pass it to bingbot…? What about if you’re browsing something local? Or your bank account?


I would be too.

The point I was making is that someone should research this instead of relying on wild speculation as the basis for the conversation.


That would be even worse.


Nobody is saying it isn't.

It's about trying to get to the core of the issue, not just the random speculation going on in the article and in this comment thread.


It is common for corporate email security appliances as well. URLs should not be used for authentication neither should email. I really want to pick brains of people that work on these types of systems to see why they don't think so.


Many people (most?) prefer to signup to services by email address. To do so, those email addresses must be verified. How would you verify it without sending them an email link?


You can verify validity of an email like that, no issue there. Just don't use that as a factor authentication. Control over an email account should not trump passwords (what you know) or proper 2fa (what you have, typically, email can be 2fa like sms and like sms it is not a good choice). If a person proves they control an email account then you ask them for additional info like secret questions or other information configured during registration.

I should not be able to take over your life because I compromised your phone which has sms, TOTP app and email.


a confirmation code?

Also, mail might not live on the same computer.


It doesn't matter if it's on the same computer. Sometimes all you need to do is click the link, not do anything on the page.


options include:

* use an interstitial page so that the actual activation is a POST request;

* send a confirmation code instead of a link


Is it ok to do a password reset through email? Because once you can do that you basically have email based authentication.

The password only makes this autentication less secure and it's not needed.


It is not. You can initiate password reser via email but additional recovery controls like security questions should still be required. In an ideal world you have 2fa as well, if you reset that via email as well the it isn't actually 2fa, it is email based 1fa with extra steps. If your 2fa has a separate mechanism for recovery as well,that would be ideal. If it was my webapp, I would use hashes of 3 answers to user chosen questions, hashes in the browser/client. It could be an object pairing as well, every user gets a list of 30 objects or so and they pick 3 pairs as a recovery combination.


Hashing an answer to many questions isn't prevent someone from guessing the input until the hashes match. So why bother hashing?

Recovery codes exist and are created at the time before recovery is necessary. But most people are going to lose their codes.

Who is going to remember what 3 things they picked out of 30 years ago?


If this were security scanning, why does it identify itself as BingBot? Doesn't that just allow cloaking and offer an easy workaround for any adversary with a modicum of intelligence?


I just love it when they "scan" password reset links.


The HTTP GET method is idempotent by specification. Visiting a webpage should not trigger password resets or any other actions by itself. If that's a problem then it's the site's fault for being defective.


You’re right that it was a bit of an oversight on my behalf, as the links were only generated after a verified human user action (signup) I had assumed the 1 time links to their email would be safe. But regardless of the link action, it shouldn’t be passing that data to Bingbot to crawl and (possibly) index in search engine results. Private email data should not be shared with search engine crawlers IMO.


So how do you implement a "one click unsubscribe" link in an email? They're on GET requests. You could use JavaScript on the resulting page to then trigger the unsubscribe but bots are now running JavaScript as well.


You show a webpage with an "Unsubscribe" button in it. The button triggers a POST request.

There's also RFC 8058: https://datatracker.ietf.org/doc/html/rfc8058


That should send one to a page with a confirmation button...


scanning with bing useragent? That's not a good idea.


Do they guarantee anywhere they’re not collecting this data to build profiles or do other analysis?


Wouldn't it be trivial to keep the list of malicious pages locally and not send any data?


You mean, push bing's entire list of malicious websites to every client? I doubt they want to or can do that.

And also use the local client to scan unknown links? They probably dont want outsiders to have access to this code.


Plus, how do you keep it from going stale?


If I were designing a system like this, I would not trust clients to perform legitimate analysis nor report legitimate results.


… but it doesn’t matter if the client is compromised, because all it would hurt is the user, right? If the client was compromised, it could just not send anything to your servers, or ignore the results, or …


What? But you're the one writing the client.


Doesn't matter. Never trust the client - it's outside of your control, it can be patched, it can be hacked, it can be spoofed, etc.


Little understanding: Undying trust of the client

Dunning-Kruger level of understanding: Never trust the client for anything ever, it's unreliable, everything must be off client.

Never mind the client is literally the interface into your system, so it being compromised is already game over for an application where the user is most vulnerable party you wanted to protect...

Deep understanding: Trusting the client requires a well thought out security model.

If the client is hacked in this case, they already have full control over what the user sees, they can cut out your remote check.

Maybe a good balance would be to hash the root of the URLs and compare those, or use fuzzy hashing on page contents, just so that the backend isn't getting a bunch of private urls that might accidentally get logged somewhere.

Trades detecting stuff hidden behind redirects for less liability on your backend, something to possibly consider depending on functional requirements.


It sounds like you’re advocating for no client at all


Just as a trivial example, how confident would you be in this auth scheme?

1. User opens Outlook and types in their email and password.

2. The app requests the user's password hash from the server and checks it.

3. Outlook tells the server auth was successful and gets a session token.


In this example you're right. For something like scanning a site for malicious content, on-device is not a bad approach. It decreases the amount of data sent to the server.

The client has a much bigger issue to worry about if the client-side malware scanning has been compromised. Malware could modify the UI/network calls such that your server-side scanning displays a positive result anyway.

You have to trust the client to display information to the user at some point. Link malware scanning that job can safely be delegated to the client. Authentication cannot.


My first guess is that giving phishers/scammers the list of all malicious domains/pages might allow them to circumvent it.


Does Gmail do this?


They definitely do link re-writing. As to what use they make of the original href attributes, I don't know.


I have observed this, but also found that BingBot modifies the query string parameters of your URL. It does this by changing a character of the URL, possibly in an attempt to find new pages?

I noticed this because I generate links with a signed token to ensure integrity and started receving invalid token crash reports in Sentry, always from BingBot..

To fix this I had to move the tokens from the query string into the URL itself to avoid BingBot changing it. eg.

http://mysite.io/do-action?token=shvgaaehr2rnyxhh-391-1 to http://mysite.io/do-action/shvgaaehr2rnyxhh-391-1/

Anyone else noticed this?


I would guess that this is probably done on purpose to avoid tripping one-time-use links. Seems like a good way to hide malware from the scanner though.


I've finally found someone else who's seen this behaviour!

I've noticed this too, and I found (in my case anyway) that Bing/Outlook seems to Rot13 the keys of the query parameters - is this what you're seeing too?


i imagine one could try use the location hash. it isnt send with the request


Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.

I don't see the problem here, all services need to do is add a page that's says "welcome back, $Username, click here to log in!" that sends a POST request to do any serious confirmation without breaking any specifications.

Microsoft claims the visiting not is BingBot but it's probably just SmartScreen system checking for malicious links/downloads/etc. like many cloud integrated security products do these days.

I can set my browser to pretend I'm BingBot, you can't derive anything meaningful from the user agent. Unless you find your secret URLs in Bing's search results, your secret links aren't actually being monitored by a search engine.


I’m fairly certain they are. My links ended up indexed in Bing search results. The only place they were ever rendered was in private emails to users. Bing should not be indexing that.


You're right, it shouldn't. It's possible that they're fetching these URLs from their customers' browsing history and submitting those (external submissions follow different crawling rules, sometimes bypassing robots.txt). Bing's webmaster information says so, at least: https://www.bing.com/webmasters/help/webmasters-guidelines-3...

For a bit of added "fun", Google will do the same, but if you add a page to robots.txt and set noindex then they won't process the noindex parameter and external indexing sources might still generates search results: https://developers.google.com/search/docs/advanced/crawling/...


That's a bit of a narrow view on this problem. When sending a link to someone, you expect that someone to view the link. Not some random mail service. Who gave the mail server permission to access the page? What if it contains copyrighted material? What if it's one of the millions of pages which don't follow the HTTP design philosophy to the letter?

This is a can of worms.


> Who gave the mail server permission to access the page?

The recipient of the email, or their employer's IT department that is paying another company for mail services.

If you send me an email with a link then I do believe I have the right to send that link to a third party service that can validate that it's not malicious. If I decide to sign up for a mail service that promises to protect me from phishing emails, then I [0] expect said service to read the emails I receive and examine the links within them. I would be upset if the service used the info I share with them for purposes other than keeping me safe, though.

I readily admit that I have, at various points in my life, signed up for services without reading the entire TOS that I agreed to. I try to choose companies that I feel I can trust to not abuse me too much, and sometimes I avoid certain services because I don't trust the company behind them enough to respect my privacy.

[0] I acknowledge that not everyone is as knowledgeable as me, and many people might not realize that this is how the protection works. So if the argument is more education, I'm in favor.


> When sending a link to someone, you expect that someone to view the link

That sounds like a narrow view of email. This has never ever been true. Corporate firewalls have always opened links, and many users use tracking blockers in their email provider that automatically opens incoming email and detect tracking cookies. I cannot stress strongly enough that you cannot rely on only one "user" clicking a link.


What if an email client implements a prefetch functionality like browsers do? You can’t expect such requests to be user-triggered.


When sending a link to someone, you expect your antivirus, your email provider, your email provider's spam filter, any intermediate email providers, the recipient email provider's spam filter, the recipient's email provider, the recipient's antivirus, your recipient's mail client, your recipient, and any other people who the message will be forwarded to, to see the link and evaluate it. The email standard is pretty clear that any number of intermediate servers and services can and will be able to see what you're sending.

Email isn't WhatsApp, there are probably at least three or four parties who will scan the link in any way they like. If you control your side you can make sure there are only one or two parties scanning the email en route, but the number can never be guaranteed to be zero without workarounds.

Who gave the mail server permission to access the page? The person who set up email on the domain. If you don't trust the hostmaster, don't send email to that domain. What if it contains copyrighted materials? Well, you just shared a plaintext link with a whole bunch of people, depending on your local legislation you may be in trouble.

You can't even expect a link clicked once by a single user in a browser to only appear once on the server side. TCP connections get dropped and retried. This isn't some kind of philosophical interpretation of a mystical protocol spec, this happens in real life. If you use POST/PUT/whatever requests, the user agent will prompt the user if they really want to repeat a request; this protection has been built in for years. It's just how browsers work and how they've been working for decades.

If your recipient is behind a proxy, the link may be visited several times each hour for up to a month while the proxy refreshes its cache. This was a more prevalent problem back in the day, these days web proxies are mostly a thing of the past; however, proxies still exist, and if you don't pay attention to those things they will bite you in the ass.

In real life bugs happen. That's fine in these cases, web dev isn't exactly rocket science, bugs are tolerated and can be fixed. The bug here isn't the fact that links get visited twice, though: the bug here is that the developers who set up their magical links forgot about idempotency when they wrote their code, or they chose to ignore the problem because they never ran into it themselves. Either way, the responsibility to get it fixed isn't on anyone but the party violating the spec.

As a workaround, S/MIME or PGP should work around most of these problems as intermediate servers can't see what's going on. What the client's machine will do with the decrypted message is still up to interpretation, of course.


> Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.

Alright, so imagine this: we have two endpoints GET "/page" and "POST /increment". Making a POST request to "/increment" increments a counter kept in memory and returns it's new value. The GET "/page" endpoint returns a HTML file, which contains JavaScript code that when executed, calls the "/increment" endpoint.

Are we now breaking the HTTP specification saying GET requests has to be idempotent if we visit "/page" in our browser? I think not, but this is sometimes how pages are implemented, which robots are gonna have to deal with, as otherwise many would consider it broken.

Don't get me wrong, I think it's a shitty implementation as well. But is it breaking the HTTP specification? Unlikely.


I don't think it's breaking the spec per se, but web crawlers execute javascript and people hit reload on their browsers, sometimes accidentally. Automating this process may not be the solution here.

Personally, I think web crawlers like Bing shouldn't be executing javascript at all but front developers can't go without their client side rendering frameworks so search engines are more or less forced to.

As for a security mechanism, you want to emulate a browser as closely as possible to detect tricks like redirects from safe domains to attack domains and obfuscated URL crap. I'd expect any automated, non-interactive code to execute in a security analysis sandbox.

Is this breaking the standard? Who knows. What is a cloud antivirus but a web user agent running in a data center? The email protocol doesn't specify how the client should deal with links, the robots.txt only works for spiders, not for manually submitted URLs like those clicked in emails, and without a noindex tag you're going to see your page indexed by the mail provider company regardless of what your robots file says.

I think in theory your solution solves the spec breaking problem, but it doesn't solve the problem in practice because there are many other components for which there are no standards and defensive programming is required.


From the spec, although it doesn't explicitly say 'dont automatically send a POST upon opening a GET', I think it's fairly clear its against the spirit of what a GET should represent to the user if it isn't a safe POST request.

"In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them."


Even if Microsoft claim this is about security scanning, isn't it fairly trivial to configure your webserver to serve up different content depending on the User-Agent request header?

BingBot scans the link, gets a dummy page with 'clean' content, Microsoft delivers the email message to the user, user clicks through the link with actual browser, gets phishing / malware content...


Yes, that is exactly what a motivated attacker would do to avoid their phishing site getting flagged as malicious. Here is a good article about how that is accomplished: https://rhinosecuritylabs.com/social-engineering/bypassing-e...


Sure, or even just ignore user agents if you know your target has this scanning in place, just send the malware to the 2nd click.

It's not just MS. Lots of enterprise email security stuff works like this.


Isn't the quick fix that you arrive at some page, and there have to press a button or load some JS to do some action? AFAIK most email providers (like Gmail) will also visit links. Therefore you shouldn't do actions directly on the GET request. For instance if you have an unsubscribe link and all you have to do is visit that address, most of your subscribers will be accidentally unsubscribed.

Same if you paste a link in Slack/FB/Discord/Twitter whatever, they will visit the page to create a preview. GET requests shouldn't have side effects.


> AFAIK most email providers (like Gmail) will also visit links.

I keep hearing this, but our newsletter system has been using GET unsubscribe links since at least 2007 (but probably longer), and we never found a wave of Gmail users unsubscribing, we still have a lot of them. I wonder if this is simply an urban legend, if Gmail tries to recognize unsubscribe links, or if there is something else going on.


your newsletter system probably ignore bots or some IPs


We do not :) It was never an issue there. Bots get ignored for stats, Google IPs get blocked for Ads (Google seems to think every ad link has to be visited by a ton of bots, our customers actually started complaining about the traffic)


Yup this is true - I was just being lazy. But what surprised me was that Bing actually indexed them. (even though my robots.txt said not to)


Can you prove that by linking to a Bing search where one of your pages show up ?



The URL in the search is this: https://shoprocket.io/email-confirmation/34b35b1...

I don't see that in the robots.txt https://shoprocket.io/robots.txt

User-agent: *

Disallow: /cdn-cgi/l/email-protection

Disallow: /login

Disallow: /register

Disallow: /404

Am I missing something?


You may be seeing a stale version, try this: https://shoprocket.io/robots.txt?bypass=1

(I made a lot of changes today when testing all, including "visit as Bingbot" from their webmaster tools with and without the URL blocked by robots.txt)


> was that Bing actually indexed them. (even though my robots.txt said not to)

Never mind indexing them (ie publishing them at Bing.com), if URLs are disallowed in robots.txt then Bing shouldn't even be retrieving them, even if only to scan the content for malware!


This is a common misconception about robots.txt. It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.

Robots.txt is not a reliable way to exclude pages from search engine indexes. That is not what it is for. It is for controlling crawler behavior.

The only reliable way to exclude a URL from a search engine index is to serve “noindex” on that URL, either with a metatag or an HTTP header, or both.


> It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.

I must confess I've been sceptial of robots.txt for a very long time (if I want to stop bots I serve them HTTP 403 Forbidden using .htaccess or similar).

Be that as it may, it appears I'm also confused about what robots.txt does and doesn't do.

Assuming you're correct: let's say I run EvilBot which scrapes sites and want to scrape your site example.com, but your robots.txt only allows Googlebot and disallows everyone else. Am I really OK to:

1. scrape the SERPs from google.com which mention your site ("site:example.com") then 2. using that list of URIs, use my EvilBot to scrape your site, without needing to touch or respect your robots.txt, since I got the list of URIs on your site from Google, not by scraping example.com directly?


Your step 1 is enough for URLs to be indexed. Even a well-behaved search engine does not need to visit your site to index a URL, including whatever anchor text pointed at it.

If the crawler does then visit your site, it will see your robots.txt and (if well-behaved) obey it and not crawl the contents of the page at that URL. But this does not mean it will remove the URL itself from its index.

Again: robots.txt is intended to control crawler behavior, not search index visibility.

Google's page is a pretty good overview of this distinction:

https://developers.google.com/search/docs/advanced/robots/in...


> Again: robots.txt is intended to control crawler behavior, not search index visibility.

I'm obviously not asking the question clearly, I'm wanting to stop bots from crawling (it's scraping that annoys me), not search engines from listing URIs.

If I want to completely stop a bot from crawling my site (in the sense of "retrieving my content"), won't robots.txt prevent that? Even in the case of the bot having obtained a valid list of my URIs but not the pages contents from a 3rd party source?

Lets say I email you a list of URIs on my site. My robots.txt forbids all crawlers. Are you allowed to give the list of URIs to your bot and retrieve the content?


You are correct: a bot that is well-behaved (follows robots.txt directions) will not crawl your site if your robots.txt forbids crawling.


This is very useful information. You’d really hope that private emails would be excluded by default…


yes but for login tokens the bing bot would be able to login none the less. Probably the login url would be in some logs at microsoft or antivirus vendor. It sounds paranoid but basically the url is a cleartext password laying around.


Yeh I think you're right, an extra JS step on the page that the email link leads to would help a lot.


Sounds like anyone dealing with any sort of vaguely sensitive information through email, and certainly any corporation, should avoid using Outlook for anything.

The article is about email verification links, which is a pretty clear case where this can be dangerous, but tons of other links can get emailed without being intended for a wider audience.

Besides, the fact that Outlook shares anything related to the content of your email with the outside world is just completely unacceptable.

(Should private links be sent over unencrypted email? Probably not. But lots of stuff gets emailed that's not super secret and yet also not meant to be shared outside the company.)


Or maybe you shouldn't rely on security through obscurity and instead should add a robots.txt as has been in the web standard since 1997.


I do have a robots.txt to block this directory. But Bing only listens to that for what to crawl, not what to index.


Exactly that.


There's a difference between adding the URLs to search engine results and accessing the URLs to scan for malware. The latter is quite common, lots of email hosts do that. It's not clear to me from the post if the former is actually happening - the author doesn't state that they found the links in Bing's results, just that they were accessed by BingBot.


I found them in Bing results


Are the URLs being served with a “noindex” header? Blocking crawls with robots.txt cannot de-list items from Google or other search engines.

> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results. If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

From https://developers.google.com/search/docs/advanced/robots/in...


> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.

I realize that you are just the messenger and not the progenitor of that policy, so not addressing this to you, but: that is ridiculous. robots.txt is basically useless.


Robots.txt is a mechanism for providing instructions to automated crawlers but I don't think they've ever been promised to be used when a URL is manually or automatically submitted through other means (i.e. another site linking to yours). In those cases, a single page will probably be crawled, but the rest of the domain probably won't.


They are now - I didn't think I had to as all the pages are naturally behind a login, it never crossed my mind that Bing would follow email links, let alone index them in search results.


That should be clearer in the article. It's not evident from the SO question or from what you've actually written. A screenshot of those search results would go a long way towards making the article sound more credible.


Yeh I was a bit reluctant to post that as it doesn't look great for my app! But here's the results: https://www.bing.com/search?q=https%3A%2F%2Fshoprocket.io%2F...


Could it be that they were taken from the MS Edge history? I mean still amazingly bad but just throwing it out there. Could explain the gmail ones as well


Quite possibly…


And google harvests all your online purchase emails to log everything you've bought.

https://www.techspot.com/news/80134-google-uses-receipts-sen...

Just a reminder any email left on any online service over six months in the USA is allowed to be read by any law enforcement agency without a warrant.

You'd think these services would have a six-month auto-delete feature but nope.

There's good reason why there was a email server in the basement, everyone should have their email server where at least a physical warrant is needed.


To all the folks that suggest preventing opening of single-use links by robots.txt or user-agent detection etc: Just don't. There are dozens of tools at use throughout the various stages of an email with URLs being delivered that will go out and fetch websites. You have to design any confirmation dialog so the user still has to click a button to confirm, otherwise any one of these tools might inadvertently trigger your confirmation.


>As of Feb 2017 Outlook (https://outlook.live.com/) scans emails

Makes me curious if only the free, online, Outlook does this. There's also paid O365 online Outlook and the fat client Outlook.


Office 365 just seems to make links useless for security now.

Our 365 instance now turns every link into this massive monolith of safelink checking URLs through Microsoft, making literally every email undeterminable if it is a phishing attempt or otherwise without turning to pasting it into one of many online 'decoders'...


Oh, probably this thing: https://docs.microsoft.com/en-us/microsoft-365/security/offi...

Though that is optional and configurable.


At work they enabled safelinks whilst all the mandatory training stated best practice was to check the links before clicking.

Its a shame those links can't have an alttext to show the real link.


You don't want to train users to disambiguate phishing with alt texts which can be spoofed in other contexts.


It's not just the free client. My university uses O365 (?) and the links are in emails checked on other clients (Mail.app of ios/macos).

Admin has also turned on "You don't often get email for __" warnings that edit the email so that gets included in replies. Very useful when you could a new large cohort of student email correspondents each semester :(


Exactly what I was thinking - what really worries me is it also seems to have happened to a lot of @gmail users too. I still can't figure out how Bing managed to find email tokens sent to gmail. Maybe users who connected their gmail account to Outlook...?


This would kind of break single-use links, no? That seems like a real nuisance.


The HTTP GET method is idempotent: it should behave the same way on multiple accesses.

A single use link, e.g. for resetting a password or confirming a subscription, will usually show a webpage with a form that does a POST. Once that POST has been performed, the single use link is used up.

Single use links will mostly have a one-time secret that should not be leaked. Mails that contain such links or any sensitive information should be encrypted.


How do you send mail to an Outlook user and encrypt it so Microsoft can't snoop on it?


If your goal is to prevent third party software like spam filters and malware engines from triggering actions, you must require a second step that will send a POST/PUT/anything-that-isn't-idempotent request. You can copy the authentication code into a form field and do the entire thing without Javascript if you want to, but a second step is necessary.

If your goal is to hide your secrets from Microsoft, then send the email encrypted or don't send it to Microsoft's servers at all. This is practically impossible, it at least impractical in most cases. You can't control the hosting provider and software of your customers.


You re-ask the password on the visited page before presenting the form responsible for the one time POST call.


Exactly. From what I've heard today it sounds like most apps have an extra step between the email link and the login, usually a JS step, to check for bots.

Not adding that was my downfall I think.


There's also robots.txt


I struggle to understand how private companies like mine are OK with MS reading all employee email and processing it through their AI. I get these daily creepy emails from MS saying that you said you would do this yesterday.. I have resorted to using burnernote.com, not to hide anything from my company but to hide it from MS who competes with us on some products. I guess burnernote.com will also not work anymore since it creates one-time links.

We are monitoring you for your protection.


In my experience Gmail does this from time to time too, so any “load once” links won’t reliably work.


I mean you can say that Twitter and Slack for example do it too, any service that generates a preview of your links, they'll crawl the URL you provide whether it's secret (eg sent in a private message) or not. Very very very few will stop at the "og:image" tags and such because why would they discard data about you?


I have observed that twitter's bot hits links within seconds of being tweeted. The traffic comes from several locations, not all twitter ASNs. One interesting source is Apple. Their bot/scanner hits soon after.


Anyone paying for the firehose access can do this.


That's true of course. What's interesting to me is that they've decided to pay for this access and visit the links so quickly. It must be pretty expensive or hard to get if only around two-dozen companies pay for access to the data[0].

[0] https://www.washingtonpost.com/technology/2022/06/08/elon-mu...


It might be that Apple is paying for the firehose as a data-source to bootstrap its search engine. Don't they have one accessible via Siri already ? (I don't follow Apple tech very closely).


Just wondering how this might be an attack vector for fucking with Bing... I can think of a couple of avenues; the most elegant would be if the URL itself triggered something within the scanner/URL processor; next up would be the content of the target page attacking the Bing infrastructure. I'd guess the backend processing is sandboxed, but it seems like an interesting avenue that a malicious actor might explore.

Don't try this at home, kids. :)


Sure, we'll do this right after cracking Google through Googlebot.


I remember sending one-time use URLs in emails to customers and they would've expired by the time they clicked them because Outlook was opening them before they did.

Yeah yeah GET is idempotent and I shouldn't do that blah blah. That's not the point.


Can this be exploited to confirm some action automatically?

Or to prevent some site from being indexed by flooding it with invalid links?


This becomes the interesting piece. What can you do with this known side effect..


Absolutely


Everything spies on you unless proven otherwise. Seems to be a rule these days.


In the B2B SaaS where I work we started using single use codes to log in for certain account types (non-admins). No password. We send you an email or an SMS with a 6-digit number. Copy/paste it to log in. Very much like 2FA except there is no password. The session lasts 30 days. The user can disconnect of course.

Curious what HN readers think. Is this secure? Sufficient?


For what it's worth, here's Google admitting that GoogleBot causes POSTs:

https://developers.google.com/search/blog/2011/11/get-post-a...

Automatically triggered POST is not sufficient to keep the bots at bay.

They seem to be implying that only automatically triggered POST are acceptable, but that was also >10 years ago.

With the way things are going, it might be that any on-page confirmation buttons won't be sufficient to keep the bots at bay. Maybe it's time to fight back, check the user-agent, and serve the bots a CAPTCHA?


If you're concerned about privacy, you shouldn't be using any Microsoft products, period.


true, but for the last 30ish years nobody cares about that opinion because it is profitable to use the smallest common denominator, get shit done and call it a day.


Okay, I thought M$ was just a little bit better than $G. It turned to be as bad...


Why would you think that? If M$ had the same position as google even more things would be closed source and more connected with law enforcement and less private.


I was naive to think so as M$ did not have as many obvious malicious moves as $G recently. But I forgot all those companies are there for money and for sure they will do whatever they can.


Don’t Windows users already just accept whatever from MS? This is what you get with a proprietary operating system. Whatever you’re served.

Now stop complaining and look at the new ads in the Start Menu.


I've noticed Office 365 Safe Links makes an OPTIONS request, not GET. So, restricting the endpoint to GET, via [HttpGet] decorator for example, may be a quick resolution.


While I think a Turing check can easily solve the problem without much friction, this only increases my hatred for Outlook scanning. The worse part - to turn it off, you also have to turn off junk mail protection (well, used to, it's been a while since I tried).

Now, having my private links indexed by Bing is a bit too much!? I sincerely hope OP is mistaken and Bingbot is actually the outlook scanner.


Unfortunately not - the links were indexed and shown in Bing search results


Can’t you block it with robots.txt or some similar method?


It was blocked by robots.txt but Bing chose to ignore it. I even tried "blocking" the URLs in Bing webmaster tools today and this was the response:

"Block request denied We found that the URL submitted for block is important for Bing users and hence cannot be blocked through Bing Webmaster Tools.

We recommend that the best way to block URLs in this scenario is to add NOINDEX meta-tag to the HTML header of the page."


That is baffling logic. Sure, they think they know best and want to ignore the wishes of the owner of the web site. Why then respect a NOINDEX meta-tag instead of robots.txt?


Exactly - seems the safest way is to explicitly block known bots by user agent from even reaching pages you don't want indexed.


No but one could block it with basic auth. That is how I keep Discord and Valve crawlers off my links.


Our company occasionally does "test phishes" to see how well people resist them. Every time, some of our most security minded engineers end up on the "clicked on the malicious link" lists, when all they did was forward the message to IT to report the phish. I'm wondering if the bingbot leak is the reason.


How long does it take for them to check the link?

I sent an email with a unique link in it to my @Outlook.com account 6hrs ago, and there have been no visits to the link. The email is in my inbox (though I have not opened it).

Does this only happen on opening the email (in the Outlook web ui)?


Isn't this just the "link preview" feature, that is enabled by default in outlook?

Many email clients generate link previews so that they can display a thumbnail of the webpage. Would seem to be necessary to filter out those referrers from the validate link


Essentially yes - but "previewing" link and sending that link to Bingbot to crawl and index is another matter.

Imagine you send someone a "private" link to a file...Bing sees that and indexes it for the world to see. Not cool.


Is this actually the Bingbot crawling for the index, or do they just use the same bingbot code to generate link previews?

If I have a well-tested crawler sitting here, I don't see why I'd write a brand new one just to fetch previews in outlook...


“Private links” should be covered by robots.txt. The only case I see this happening is for those “anyone with link” shares and those are easy to cover.


AFAIK, most of those "private" links are actually just "unlisted" but they're still public. I'm sure Bing is indexing those too...


Anything private should ideally be put behind authentication. If that isn't possible, than robots.txt. Search engines are _meant_ to index everything that is publicly accessible and not blacklisted on robots.txt.


This reminds me of that time Bing was caught stealing search results from Google queries.


Does Bing observe robots.txt? If it does, that can put your token URL out of harms way.


That’s why I shield my URL with a casual password prompt because external email scanners has no business looking into the Enclosed email URL.

Email address domain and enclosed URL domain are the same parent domain.


More likely scanning for vulnerabilities or generating of previews. Magic links don’t work anymore. Therefor most services sends a code or something that you have to enter on a generic page.


Not the least surprised. By accepting the EULA and using the service free of monetary charge, you are instead paying with the contents of your e-mails and adress-book.


Another reason to run your own email server


No. Another reason to pay professionals to run your email server for you.


How would you prevent recipients from using Outlook?


We can control one side of the equation, so that prevents them from reading some at least.


If it’s Outlook doing the dirty work then running your own mail server won’t get rid of this problem. Outlook is the client reading the links regardless of mail server.

Also, somewhat a different topic, people that run their own mail servers might also still use Outlook.


Exactly. I also noticed Bing had accessed some non microsoft tokens too. Even gmail accounts were affected. I assume some people have connected their gmail account to the outlook client?


proofpoint does this too. Wonder if they add a separate header or query string param to distinguish from an actual user.


Just wait till someone figures out how this "leaks" personal info. They'll very quickly remove it.


tldr: In 2022, whether you are a paying customer or a free customer, YOU are the product and you will be squeezed for all you got (not specific to Microsoft at all)


I don't think that's very surprising for most people, the real takeaway is that not only will Bing read your emails, but they may also index any links you send and serve them in search results.


Actually, you are right. I stand corrected, this really is a new low


Wow, you really think that all of the Fortune 500 companies using GSuite/Office365 are being squeezed for everything they have got?


It might lead to sensitive data leak as cloud storage links can also be crawled to Bing


> It might lead to sensitive data leak as cloud storage links can also be crawled to Bing

no what might link to sensitive data leak is fools who store sensitive data on unprotected links


Agree, in my experience storage buckets are always private by default, and you must take several specific steps to make them public, ignoring the very big warnings sprinkled in each confirmation page along the way.

Are there any cloud vendors that don't follow this approach?


Almost all of them. A good example is Dropbox link you send to someone. I could generate this link to a private file in my Dropbox, email it you, and Bing (may) index it.

https://www.dropbox.com/s/vucien2ns8jktga/denim%20bodywarmer...

I doubt many people realise this when they email "private" links...


Google reads your emails.

Whenever I buy a flight, google puts the date on "my" calendar.

Just lets not pretend Microfsoft is especially bad at this, ok?


But that is your calendar, not something that is normally speaking visible to the whole web.


It's not "your" calendar, it's Google's calendar.


You know perfectly well what I meant.


By that logic they are not your emails, they are Google's.


Which would be correct, considering that ownership implies full and complete right of dominion over said entity, which you simply don't have. You could be locked out of your account with no means of getting back access, you can delete your data but have no guarantees that the data has been deleted, Google may create 'derivative works' on your data (see the terms of use) without your permission or will provide data about your account to authorities, etc.. That is not ownership, that's renting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: