“Magic links” can end up in Bing search results, rendering them useless

sandermvanvliet · on June 27, 2022

I've had to deal with this with e-mail verification links and Auth0. The user clicked the link after getting it in their mailbox but then Auth0 throws up an error page because the e-mail address has already been verified (by Outlook scanning). The problem becomes worse if for some reason the mail ends up in the junk mail folder so the user thinks they've never received the mail but when you check it looks like the e-mail address was verified successfully. That has caused a lot of annoying back and forth trying to figure out what the hell is going on. We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified. Super annoying.

dmw_ng · on June 27, 2022

Links like this are stupid regardless of Outlook's behaviour because they require a perfectly reliable client and network and user in a perfectly undisturbed flow. If I can't F5, if I double-click, if my mouse is wonky, my wifi is bad, my power goes out, my computer hangs, my DSL dies just after a click, if I accidentally close the tab.. there are any of a thousand reasons why abusing GET for a one-time-use page or redirect is horribly wrong.

It takes incredible arrogance to continue using them in order to "improve usability" given all the obvious and common cases where they completely destroy usability. The difficulty for a provider to verify they aren't sending you to a phishing or browser 0day page barely scratches the surface.

rsbadger · on June 27, 2022

The only purpose of this link was to verify that the email address is valid. Once it’s verified, you can login.

TremendousJudge · on June 27, 2022

I have seen services where you have to click a link every time you want to log in

roci89 · on June 27, 2022

They are called magic links... only thing magic about them is their ability to annoy me

stetrain · on June 27, 2022

I think they exist to simplify the flow for the subset of users who end up using the Reset Password link each time their session expires.

And I think that subset is much larger than some would expect.

asdfqwertzxcv · on June 28, 2022

This. You'd be amazed how many users just do a password reset each time to login instead of remembering their login info.

chalupa-man · on June 28, 2022

My father has insisted on doing this for over 20 years, but he doesn't know how to do it himself. I expect a password-reset phone call from him every 2 or 3 days and have done since 1998. Just recently he had someone from his bank's IT department call him directly about resetting his password over 500 times.

exikyut · on June 28, 2022

I'm not sure if he's still doing it but someone put together https://theuserisdrunk.com/ and https://theuserismymom.com/ a few years back... I wonder if you could do something similar here, given the level of absolute predictability that seems to be involved.

I sadly can't put my finger on what's so compelling about this, just that my "oh that person should talk to a UX team lead!" meter just went plink

rapind · on June 27, 2022

Or "passwordless" login, and I love it. Not many people use password managers and will reuse passwords between websites (I.e. their bank and some random unsecured SaaS product). One-time emailed passwords are an easy way to avoid this problem and have a fairly secure site (mind you, it's only as secure as their email). You can layer 2FA on top of this too.

It's only annoying if the site is constantly timing you out so that every single visit you need to resend. Why not just use secure cookies to remember the user for say a week?

shantnutiwari · on June 27, 2022

>They are called magic links... only thing magic about them is their ability to annoy me

I love them and prefer them to creating yet another account with a password.

asdfqwertzxcv · on June 28, 2022

Me too!

brundolf · on June 27, 2022

I had a similar case recently where I was getting the magic link in an email on my phone, and needed to copy it into Slack so I could click it on the laptop I wanted to actually log in on

This... was impossible to do, because by long-pressing on iOS to get the Copy prompt, iOS also goes ahead and opens a preview of the link next to it

withinboredom · on June 29, 2022

Haha, I was in a restaurant and you paid through your phone. My browser updated so it closed between the thank you page and the payment click. State was lost so the thank you page was broken. The restaurant didn’t think I paid but my bank account said otherwise (this was a bank transfer via ideal, not credit card). Getting out of there without paying twice was entertaining.

dTal · on June 27, 2022

>We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified.

That's a yikes from me! So I can sign up on your service as anyone with an Outlook account, without verification?

kevin_thibedeau · on June 27, 2022

Facebook has allowed this in the past and someone recently opened an unverified Instagram account with my address.

colejohnson66 · on June 27, 2022

I'd assume the custom page has a random URL and requires entering the email address (requiring a match) or clicking a button. I've seen some account confirmation pages like that.

nieve · on June 28, 2022

Discord let someone sign up with my gmail email address, sent an email verification link, and before I saw either the "welcome to Discord" or "please verify your email" links they'd already let the person in as me. I don't know if this is because of google crawling links from mail or some other kind of failure, but I wasn't pleased that Discord would let someone impersonate me.

sandermvanvliet · on June 30, 2022

No it would still require username/password. This was only verifying the email address was correct.

capableweb · on June 27, 2022

Wow! I think you just figured out an issue I had while working in a previous company using Auth0, where the authentication token would expire before the user had actually gone there (so the user saw an error page when clicking), but on our side it looked like the user went there but dropped off directly after. Had maybe 1% of the users complaining about this, but we never found the root-cause (we moved to our own authentication before we could figure it out). This has to have been why. Thanks for sharing this!

sandermvanvliet · on June 30, 2022

Yw. Took us ages to figure out

hannob · on June 27, 2022

HTTP GET requests are supposed to be idempotent, meaning that when you call an URL twice it should not lead to any different result compared to calling it once. This is part of the HTTP standard.

So while I think what Outlook does here is wrong, what these webpages do is simply a bug that should be fixed and shows a lack of understanding of HTTP.

capableweb · on June 27, 2022

> shows a lack of understanding of HTTP

I think that's a bit too much. Nothing in that suggests that they are breaking anything in the HTTP specification. You're right that GET requests has to be idempotent, but the exchange from the single-time use code you get in email with the API token, is most likely behind a non-GET request (like POST). The HTTP server responds to GET requests with the static assets (HTML/CSS/JS), but then the static assets has JavaScript that calls the POST endpoint for the exchange.

At least that's my guess. I agree it's a bug on their side, and they should fix it. But I think it's more of a UX issue than breaking the protocol.

zbuf · on June 27, 2022

Agree, the commentor is hung up on their demonstrably superior understanding of HTTP.

We found at least one scanning service to be fetching the URL with the user agent of a browser, and executing JavaScript on the page.

A lot of our user interactions could be simpler, but this sort of behaviour led to many things being put behind a "go" button.

I only wonder how long before scanners and search engines start clicking these buttons to activate content on the page so they can scan/index it.

aaaaaaaaaaab · on June 27, 2022

It's idempotent. Your account won't get un-verified if you open the verification link twice.

Idempotent != side-effect free

dragonwriter · on June 27, 2022

GET is supposed to be safe as well as idempotent though.

Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. (RFC 7231, § 4.2.1)

capableweb · on June 27, 2022

It's not though. Loading the page twice creates two different outcomes. Idempotent endpoints can't.

!Idempotent != Reversal of changes

aaaaaaaaaaab · on June 27, 2022

What are the two different outcomes?

1. You open the link once = your account is verified

2. You open the link twice = your account is still verified

???

adamckay · on June 27, 2022

If it's a magic link for logging in:

1. You open the link once = you're logged in

2. You open the link twice = you're presented with an error that the magic link has already been used.

okamiueru · on June 27, 2022

I wouldn't go around saying others don't understand http. Puts you in a very awkward position when you are wrong. Which, you are. Idempotency and safety are separate concepts related to http. But, you do you. Take the advice however you want.

chexum · on June 27, 2022

IMHO it is clearly a wrong assumption on the side of any such sender. A verification link should have clear definite actions for the user receiving it:

- It's me, let me confirm my address

- I never signed up for this heap of diamonds

Whenever I (not even a bot) click or follow a link from my mailbox, by accident or on purpose, I don't expect that to validate an account for anyone else, but me, intentionally, using a password I know.

heipei · on June 27, 2022

Honestly that's why I built the email verification page so the user still has to click a button on that page.

duxup · on June 27, 2022

I had a customer who had some sort of software that followed the link in the email we sent (no big deal so far), and THEN would follow every link and button on that page.

We had a handy quick decline and accept button on there so they were auto declining things…

I didn’t hate email until I got into web development….

jkaptur · on June 27, 2022

The crawler followed buttons on forms? And sent POST requests? Yikes..

duxup · on June 27, 2022

Yup. It was a pain. I have no idea who wrote that and thought it was a good idea…

It was a super basic web form too. Probably the most html markup standard thing we have. Nothing strange about it that could have triggered some sort of strange behavior.

aabbcc1241 · on July 1, 2022

Having decline or accept buttons may not be enough.

How about having a input field asking for the email again to double check?

NullPrefix · on June 27, 2022

GET requests are supposed to be idempotent

bigtones · on June 27, 2022

Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks. If Bing has not seen the web page before and it's not in the Bing dangerous web page index it first needs to check it to make a determination of if it's a phishing/malware page by scanning/indexing it before returning that outcome back to Outlook to flag the email as dangerous.

sofixa · on June 27, 2022

> Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks

The problem with that is that the logic is broken. Microsoft cannot possibly know all phishing sites, especially for smaller things. By obfuscating the link the user can no longer verify it by themselves without clicking, but Microsoft will say it's safe. So the user is left with a false sense of security and are worse off.

It only works for huge sites ( e.g. mytwitter.lol phishing for twitter and similar), but drastically lowers the chance of less high profile phishing being caught.

phendrenad2 · on June 27, 2022

The problem with that is that the logic is broken. If 99.99% of phishing can be prevented this way, what problem do you have with it? Would you really catch that 0.01% that an automated system wouldn't?

hansvm · on June 29, 2022

You mean you don't verify calls to action via other information channels? Fairly regularly I get phishing emails that correctly spoof the crypto headers of major sites (e.g., because of a misconfigured mail service). If an email asks me to do something, it either doesn't get done or I cover my ass in as many ways as possible, no exceptions.

That isn't by itself an argument against a good automated system -- I definitely like not having to sift through most of that garbage, but catching the 0.01% should be a routine practice, not something that seems like an insurmountable burden.

tomp · on June 27, 2022

Maybe you should implement a "feature" that serves a simple static HTML page <p>This webpage is safe.</p> to "bingbot" and serve the real page to everyone else.

ggurface · on June 27, 2022

I believe the recommended practice is to hover over the URL before clicking the link.

If you do so, in Outlook, there will be a pop that shows "Original URL: XXX". This allows users to make a determination for themselves whether the link is safe or not.

hyperman1 · on June 27, 2022

We got some security courses about that too. Unfortunately, outlook replaces all of them with some safelink url rewriting, so the only way left to find out if a link is scammy is clicking it.

excalibur · on June 27, 2022

It is in fact possible to extract a destination URL from a Safelink one without clicking it. For the full link this can be tedious, but identifying the domain can still be done quickly.

hyperman1 · on June 28, 2022

For normal URLs, I agree. But in this case you have adversarial urls. Suppose the scammer puts some http and www.google.com in the url parameters, after some randomly generated 8 characters dot someobscuretld site.

I don't trust myself enough to be 100% sure I can decode an URLencoded misleading mess perfectly all the time.

They already hid urls in the username of the url, like www.google.com.unholymessherethatscrollsoutoftheurlbar @ malignantdomainnotgoogle.blah

egberts1 · on June 27, 2022

Scammy Microsoft.

jabart · on June 27, 2022

Microsoft offers this as a security product. It's impossible to know all links but known ones can be blocked to limit future issues. Other enterprise email security products scan the links and follow all the redirects as well. After delivery a incredibly small amount of time and every link is "clicked" in an email with those products.

Mo3 · on June 27, 2022

Cool, so we should just stop building and running almost anything in existence because it's not all-encompassing? That sounds like a suboptimal path forward.

rsbadger · on June 27, 2022

Scanning something for malware and publishing it in search results seem like 2 completely different things to me...?

Xylakant · on June 27, 2022

But there is nothing to indicate either in the post or in the referenced SO thread that the URLs are published to the search results. They are visited by bingbot, that much seems confirmed, but there’s no example where one of these results shows up in the public search results.

rsbadger · on June 27, 2022

They were indexed in Bing results, I’ve shared the URL of that in this thread

cma · on June 27, 2022

Holy shit thats bad. Do unlisted youtube and gdrive share links get indexed through this?

9dev · on June 27, 2022

You’d assume those have proper robots.txt configuration?

boxed · on June 27, 2022

I have a disallow all robots.txt for a production system. Have had from the beginning.

Bing indexes it. This is my first major security incident and I have no idea how to fix this without making everything totally shitty for the users.

efreak · on June 30, 2022

Some services ignore global disallow, but will respect rules explicitly targeted at them.

boxed · on June 30, 2022

I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.

ratg13 · on June 27, 2022

Yes, but there is no indication they are publishing it in the search results.

The original post is just complaining that the malware scanning is visiting the links.

They come to the following conclusion

>This effectively makes all one-time use links like login/pass-reset/etc useless.

Which we all know is not true because sites like onetimesecret.com allow for entering a separate password to prevent this sort of thing when it does happen.

It would be an interesting discussion to talk about what Microsoft's whitelisting process looks like, but the original article doesn't seem to understand what is going on well enough to drive the conversation in that direction.

nopassrecover · on June 27, 2022

They are publishing them - it has bitten us (e.g. expired one click links for customers ending up on Bing from their emails)

pooper · on June 27, 2022

I think this is where we use meta tags.

All pages with one click links should have no index follow or no index no follow. Your seo consultant (if you have one) should have advised you on this.

I am not saying this excuses the privacy violation but just suggesting there are things we can do...

nopassrecover · on July 4, 2022

Bing appears to be ignoring those headers for links crawled from emails

_tom_ · on June 27, 2022

Worse would be links that are private to the people who posses the url. Like a private video on YouTube or a private document in google docs. The security depends on the URL being secret. This would silently publish secret information.

9dev · on June 27, 2022

If those pages have no proper meta tags or robots.txt, there’s absolutely nothing wrong with this. Security by obscurity was never a good approach; from Proxies to security scanners, there has always been software that crawls unassuming URLs and published the results somewhere, if only a report to the admin.

boxed · on June 27, 2022

robots.txt disallow is ignored for my production site at least. This is super bad.

nopassrecover · on July 4, 2022

Same for us - we have robots.txt disallow etc. and the relevant headers for personal customer links and Bing is ignoring and publishing all the same

ratg13 · on June 27, 2022

If you can say for certain that the links being published are coming from the malware scanning, and not being taken from users' browser sessions that are using Microsoft Edge you should elaborate on this.

rsbadger · on June 27, 2022

I would be pretty mortified if browsers were using user browser sessions to scan content and pass it to bingbot…? What about if you’re browsing something local? Or your bank account?

ratg13 · on June 27, 2022

I would be too.

The point I was making is that someone should research this instead of relying on wild speculation as the basis for the conversation.

boesboes · on June 27, 2022

That would be even worse.

ratg13 · on June 27, 2022

Nobody is saying it isn't.

It's about trying to get to the core of the issue, not just the random speculation going on in the article and in this comment thread.

badrabbit · on June 27, 2022

It is common for corporate email security appliances as well. URLs should not be used for authentication neither should email. I really want to pick brains of people that work on these types of systems to see why they don't think so.

rsbadger · on June 27, 2022

Many people (most?) prefer to signup to services by email address. To do so, those email addresses must be verified. How would you verify it without sending them an email link?

badrabbit · on June 27, 2022

You can verify validity of an email like that, no issue there. Just don't use that as a factor authentication. Control over an email account should not trump passwords (what you know) or proper 2fa (what you have, typically, email can be 2fa like sms and like sms it is not a good choice). If a person proves they control an email account then you ask them for additional info like secret questions or other information configured during registration.

I should not be able to take over your life because I compromised your phone which has sms, TOTP app and email.

throwaway14356 · on June 27, 2022

a confirmation code?

Also, mail might not live on the same computer.

xboxnolifes · on June 28, 2022

It doesn't matter if it's on the same computer. Sometimes all you need to do is click the link, not do anything on the page.

inopinatus · on June 27, 2022

options include:

* use an interstitial page so that the actual activation is a POST request;

* send a confirmation code instead of a link

postalrat · on June 27, 2022

Is it ok to do a password reset through email? Because once you can do that you basically have email based authentication.

The password only makes this autentication less secure and it's not needed.

badrabbit · on June 28, 2022

It is not. You can initiate password reser via email but additional recovery controls like security questions should still be required. In an ideal world you have 2fa as well, if you reset that via email as well the it isn't actually 2fa, it is email based 1fa with extra steps. If your 2fa has a separate mechanism for recovery as well,that would be ideal. If it was my webapp, I would use hashes of 3 answers to user chosen questions, hashes in the browser/client. It could be an object pairing as well, every user gets a list of 30 objects or so and they pick 3 pairs as a recovery combination.

postalrat · on June 28, 2022

Hashing an answer to many questions isn't prevent someone from guessing the input until the hashes match. So why bother hashing?

Recovery codes exist and are created at the time before recovery is necessary. But most people are going to lose their codes.

Who is going to remember what 3 things they picked out of 30 years ago?

1024core · on June 27, 2022

If this were security scanning, why does it identify itself as BingBot? Doesn't that just allow cloaking and offer an easy workaround for any adversary with a modicum of intelligence?

usrn · on June 27, 2022

I just love it when they "scan" password reset links.

noisem4ker · on June 27, 2022

The HTTP GET method is idempotent by specification. Visiting a webpage should not trigger password resets or any other actions by itself. If that's a problem then it's the site's fault for being defective.

rsbadger · on June 27, 2022

You’re right that it was a bit of an oversight on my behalf, as the links were only generated after a verified human user action (signup) I had assumed the 1 time links to their email would be safe. But regardless of the link action, it shouldn’t be passing that data to Bingbot to crawl and (possibly) index in search engine results. Private email data should not be shared with search engine crawlers IMO.

petercooper · on June 28, 2022

So how do you implement a "one click unsubscribe" link in an email? They're on GET requests. You could use JavaScript on the resulting page to then trigger the unsubscribe but bots are now running JavaScript as well.

noisem4ker · on June 30, 2022

You show a webpage with an "Unsubscribe" button in it. The button triggers a POST request.

There's also RFC 8058: https://datatracker.ietf.org/doc/html/rfc8058

AtNightWeCode · on June 27, 2022

That should send one to a page with a confirmation button...

ShowalkKama · on June 27, 2022

scanning with bing useragent? That's not a good idea.

tinus_hn · on June 27, 2022

Do they guarantee anywhere they’re not collecting this data to build profiles or do other analysis?

causi · on June 27, 2022

Wouldn't it be trivial to keep the list of malicious pages locally and not send any data?

Linosaurus · on June 27, 2022

You mean, push bing's entire list of malicious websites to every client? I doubt they want to or can do that.

And also use the local client to scan unknown links? They probably dont want outsiders to have access to this code.

mbesto · on June 27, 2022

Plus, how do you keep it from going stale?

morley · on June 27, 2022

If I were designing a system like this, I would not trust clients to perform legitimate analysis nor report legitimate results.

gnubison · on June 27, 2022

… but it doesn’t matter if the client is compromised, because all it would hurt is the user, right? If the client was compromised, it could just not send anything to your servers, or ignore the results, or …

causi · on June 27, 2022

What? But you're the one writing the client.

sofixa · on June 27, 2022

Doesn't matter. Never trust the client - it's outside of your control, it can be patched, it can be hacked, it can be spoofed, etc.

BoorishBears · on June 27, 2022

Little understanding: Undying trust of the client

Dunning-Kruger level of understanding: Never trust the client for anything ever, it's unreliable, everything must be off client.

Never mind the client is literally the interface into your system, so it being compromised is already game over for an application where the user is most vulnerable party you wanted to protect...

Deep understanding: Trusting the client requires a well thought out security model.

If the client is hacked in this case, they already have full control over what the user sees, they can cut out your remote check.

Maybe a good balance would be to hash the root of the URLs and compare those, or use fuzzy hashing on page contents, just so that the backend isn't getting a bunch of private urls that might accidentally get logged somewhere.

Trades detecting stuff hidden behind redirects for less liability on your backend, something to possibly consider depending on functional requirements.

shepherdjerred · on June 27, 2022

It sounds like you’re advocating for no client at all

Spivak · on June 27, 2022

Just as a trivial example, how confident would you be in this auth scheme?

1. User opens Outlook and types in their email and password.

2. The app requests the user's password hash from the server and checks it.

3. Outlook tells the server auth was successful and gets a session token.

shepherdjerred · on June 27, 2022

In this example you're right. For something like scanning a site for malicious content, on-device is not a bad approach. It decreases the amount of data sent to the server.

The client has a much bigger issue to worry about if the client-side malware scanning has been compromised. Malware could modify the UI/network calls such that your server-side scanning displays a positive result anyway.

You have to trust the client to display information to the user at some point. Link malware scanning that job can safely be delegated to the client. Authentication cannot.

jcranberry · on June 27, 2022

My first guess is that giving phishers/scammers the list of all malicious domains/pages might allow them to circumvent it.

Rastonbury · on June 27, 2022

Does Gmail do this?

jgalt212 · on June 27, 2022

They definitely do link re-writing. As to what use they make of the original href attributes, I don't know.

rbut · on June 27, 2022

I have observed this, but also found that BingBot modifies the query string parameters of your URL. It does this by changing a character of the URL, possibly in an attempt to find new pages?

I noticed this because I generate links with a signed token to ensure integrity and started receving invalid token crash reports in Sentry, always from BingBot..

To fix this I had to move the tokens from the query string into the URL itself to avoid BingBot changing it. eg.

http://mysite.io/do-action?token=shvgaaehr2rnyxhh-391-1 to http://mysite.io/do-action/shvgaaehr2rnyxhh-391-1/

Anyone else noticed this?

batch12 · on June 27, 2022

I would guess that this is probably done on purpose to avoid tripping one-time-use links. Seems like a good way to hide malware from the scanner though.

liam_ja · on June 29, 2022

I've finally found someone else who's seen this behaviour!

I've noticed this too, and I found (in my case anyway) that Bing/Outlook seems to Rot13 the keys of the query parameters - is this what you're seeing too?

throwaway14356 · on June 27, 2022

i imagine one could try use the location hash. it isnt send with the request

jeroenhd · on June 27, 2022

Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.

I don't see the problem here, all services need to do is add a page that's says "welcome back, $Username, click here to log in!" that sends a POST request to do any serious confirmation without breaking any specifications.

Microsoft claims the visiting not is BingBot but it's probably just SmartScreen system checking for malicious links/downloads/etc. like many cloud integrated security products do these days.

I can set my browser to pretend I'm BingBot, you can't derive anything meaningful from the user agent. Unless you find your secret URLs in Bing's search results, your secret links aren't actually being monitored by a search engine.

rsbadger · on June 27, 2022

I’m fairly certain they are. My links ended up indexed in Bing search results. The only place they were ever rendered was in private emails to users. Bing should not be indexing that.

jeroenhd · on June 27, 2022

You're right, it shouldn't. It's possible that they're fetching these URLs from their customers' browsing history and submitting those (external submissions follow different crawling rules, sometimes bypassing robots.txt). Bing's webmaster information says so, at least: https://www.bing.com/webmasters/help/webmasters-guidelines-3...

For a bit of added "fun", Google will do the same, but if you add a page to robots.txt and set noindex then they won't process the noindex parameter and external indexing sources might still generates search results: https://developers.google.com/search/docs/advanced/crawling/...

elric · on June 27, 2022

That's a bit of a narrow view on this problem. When sending a link to someone, you expect that someone to view the link. Not some random mail service. Who gave the mail server permission to access the page? What if it contains copyrighted material? What if it's one of the millions of pages which don't follow the HTTP design philosophy to the letter?

This is a can of worms.

names_are_hard · on June 27, 2022

> Who gave the mail server permission to access the page?

The recipient of the email, or their employer's IT department that is paying another company for mail services.

If you send me an email with a link then I do believe I have the right to send that link to a third party service that can validate that it's not malicious. If I decide to sign up for a mail service that promises to protect me from phishing emails, then I [0] expect said service to read the emails I receive and examine the links within them. I would be upset if the service used the info I share with them for purposes other than keeping me safe, though.

I readily admit that I have, at various points in my life, signed up for services without reading the entire TOS that I agreed to. I try to choose companies that I feel I can trust to not abuse me too much, and sometimes I avoid certain services because I don't trust the company behind them enough to respect my privacy.

[0] I acknowledge that not everyone is as knowledgeable as me, and many people might not realize that this is how the protection works. So if the argument is more education, I'm in favor.

phendrenad2 · on June 27, 2022

> When sending a link to someone, you expect that someone to view the link

That sounds like a narrow view of email. This has never ever been true. Corporate firewalls have always opened links, and many users use tracking blockers in their email provider that automatically opens incoming email and detect tracking cookies. I cannot stress strongly enough that you cannot rely on only one "user" clicking a link.

layer8 · on June 27, 2022

What if an email client implements a prefetch functionality like browsers do? You can’t expect such requests to be user-triggered.

jeroenhd · on June 28, 2022

When sending a link to someone, you expect your antivirus, your email provider, your email provider's spam filter, any intermediate email providers, the recipient email provider's spam filter, the recipient's email provider, the recipient's antivirus, your recipient's mail client, your recipient, and any other people who the message will be forwarded to, to see the link and evaluate it. The email standard is pretty clear that any number of intermediate servers and services can and will be able to see what you're sending.

Email isn't WhatsApp, there are probably at least three or four parties who will scan the link in any way they like. If you control your side you can make sure there are only one or two parties scanning the email en route, but the number can never be guaranteed to be zero without workarounds.

Who gave the mail server permission to access the page? The person who set up email on the domain. If you don't trust the hostmaster, don't send email to that domain. What if it contains copyrighted materials? Well, you just shared a plaintext link with a whole bunch of people, depending on your local legislation you may be in trouble.

You can't even expect a link clicked once by a single user in a browser to only appear once on the server side. TCP connections get dropped and retried. This isn't some kind of philosophical interpretation of a mystical protocol spec, this happens in real life. If you use POST/PUT/whatever requests, the user agent will prompt the user if they really want to repeat a request; this protection has been built in for years. It's just how browsers work and how they've been working for decades.

If your recipient is behind a proxy, the link may be visited several times each hour for up to a month while the proxy refreshes its cache. This was a more prevalent problem back in the day, these days web proxies are mostly a thing of the past; however, proxies still exist, and if you don't pay attention to those things they will bite you in the ass.

In real life bugs happen. That's fine in these cases, web dev isn't exactly rocket science, bugs are tolerated and can be fixed. The bug here isn't the fact that links get visited twice, though: the bug here is that the developers who set up their magical links forgot about idempotency when they wrote their code, or they chose to ignore the problem because they never ran into it themselves. Either way, the responsibility to get it fixed isn't on anyone but the party violating the spec.

As a workaround, S/MIME or PGP should work around most of these problems as intermediate servers can't see what's going on. What the client's machine will do with the decrypted message is still up to interpretation, of course.

capableweb · on June 27, 2022

> Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.

Alright, so imagine this: we have two endpoints GET "/page" and "POST /increment". Making a POST request to "/increment" increments a counter kept in memory and returns it's new value. The GET "/page" endpoint returns a HTML file, which contains JavaScript code that when executed, calls the "/increment" endpoint.

Are we now breaking the HTTP specification saying GET requests has to be idempotent if we visit "/page" in our browser? I think not, but this is sometimes how pages are implemented, which robots are gonna have to deal with, as otherwise many would consider it broken.

Don't get me wrong, I think it's a shitty implementation as well. But is it breaking the HTTP specification? Unlikely.

jeroenhd · on June 28, 2022

I don't think it's breaking the spec per se, but web crawlers execute javascript and people hit reload on their browsers, sometimes accidentally. Automating this process may not be the solution here.

Personally, I think web crawlers like Bing shouldn't be executing javascript at all but front developers can't go without their client side rendering frameworks so search engines are more or less forced to.

As for a security mechanism, you want to emulate a browser as closely as possible to detect tricks like redirects from safe domains to attack domains and obfuscated URL crap. I'd expect any automated, non-interactive code to execute in a security analysis sandbox.

Is this breaking the standard? Who knows. What is a cloud antivirus but a web user agent running in a data center? The email protocol doesn't specify how the client should deal with links, the robots.txt only works for spiders, not for manually submitted URLs like those clicked in emails, and without a noindex tag you're going to see your page indexed by the mail provider company regardless of what your robots file says.

I think in theory your solution solves the spec breaking problem, but it doesn't solve the problem in practice because there are many other components for which there are no standards and defensive programming is required.

tidenly · on June 28, 2022

From the spec, although it doesn't explicitly say 'dont automatically send a POST upon opening a GET', I think it's fairly clear its against the spirit of what a GET should represent to the user if it isn't a safe POST request.

"In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them."

logifail · on June 27, 2022

Even if Microsoft claim this is about security scanning, isn't it fairly trivial to configure your webserver to serve up different content depending on the User-Agent request header?

BingBot scans the link, gets a dummy page with 'clean' content, Microsoft delivers the email message to the user, user clicks through the link with actual browser, gets phishing / malware content...

hazmazlaz · on June 27, 2022

Yes, that is exactly what a motivated attacker would do to avoid their phishing site getting flagged as malicious. Here is a good article about how that is accomplished: https://rhinosecuritylabs.com/social-engineering/bypassing-e...

eli · on June 27, 2022

Sure, or even just ignore user agents if you know your target has this scanning in place, just send the malware to the 2nd click.

It's not just MS. Lots of enterprise email security stuff works like this.

matsemann · on June 27, 2022

Isn't the quick fix that you arrive at some page, and there have to press a button or load some JS to do some action? AFAIK most email providers (like Gmail) will also visit links. Therefore you shouldn't do actions directly on the GET request. For instance if you have an unsubscribe link and all you have to do is visit that address, most of your subscribers will be accidentally unsubscribed.

Same if you paste a link in Slack/FB/Discord/Twitter whatever, they will visit the page to create a preview. GET requests shouldn't have side effects.

Semaphor · on June 27, 2022

> AFAIK most email providers (like Gmail) will also visit links.

I keep hearing this, but our newsletter system has been using GET unsubscribe links since at least 2007 (but probably longer), and we never found a wave of Gmail users unsubscribing, we still have a lot of them. I wonder if this is simply an urban legend, if Gmail tries to recognize unsubscribe links, or if there is something else going on.

FrenchDevRemote · on June 27, 2022

your newsletter system probably ignore bots or some IPs

Semaphor · on June 27, 2022

We do not :) It was never an issue there. Bots get ignored for stats, Google IPs get blocked for Ads (Google seems to think every ad link has to be visited by a ton of bots, our customers actually started complaining about the traffic)

rsbadger · on June 27, 2022

Yup this is true - I was just being lazy. But what surprised me was that Bing actually indexed them. (even though my robots.txt said not to)

bigtones · on June 27, 2022

Can you prove that by linking to a Bing search where one of your pages show up ?

rsbadger · on June 27, 2022

yeh...

https://www.bing.com/search?q=https%3A%2F%2Fshoprocket.io%2F...

scandinavian · on June 27, 2022

The URL in the search is this: https://shoprocket.io/email-confirmation/34b35b1...

I don't see that in the robots.txt https://shoprocket.io/robots.txt

User-agent: *

Disallow: /cdn-cgi/l/email-protection

Disallow: /login

Disallow: /register

Disallow: /404

Am I missing something?

rsbadger · on June 27, 2022

You may be seeing a stale version, try this: https://shoprocket.io/robots.txt?bypass=1

(I made a lot of changes today when testing all, including "visit as Bingbot" from their webmaster tools with and without the URL blocked by robots.txt)

logifail · on June 27, 2022

> was that Bing actually indexed them. (even though my robots.txt said not to)

Never mind indexing them (ie publishing them at Bing.com), if URLs are disallowed in robots.txt then Bing shouldn't even be retrieving them, even if only to scan the content for malware!

snowwrestler · on June 27, 2022

This is a common misconception about robots.txt. It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.

Robots.txt is not a reliable way to exclude pages from search engine indexes. That is not what it is for. It is for controlling crawler behavior.

The only reliable way to exclude a URL from a search engine index is to serve “noindex” on that URL, either with a metatag or an HTTP header, or both.

logifail · on June 27, 2022

> It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.

I must confess I've been sceptial of robots.txt for a very long time (if I want to stop bots I serve them HTTP 403 Forbidden using .htaccess or similar).

Be that as it may, it appears I'm also confused about what robots.txt does and doesn't do.

Assuming you're correct: let's say I run EvilBot which scrapes sites and want to scrape your site example.com, but your robots.txt only allows Googlebot and disallows everyone else. Am I really OK to:

1. scrape the SERPs from google.com which mention your site ("site:example.com") then 2. using that list of URIs, use my EvilBot to scrape your site, without needing to touch or respect your robots.txt, since I got the list of URIs on your site from Google, not by scraping example.com directly?

snowwrestler · on June 27, 2022

Your step 1 is enough for URLs to be indexed. Even a well-behaved search engine does not need to visit your site to index a URL, including whatever anchor text pointed at it.

If the crawler does then visit your site, it will see your robots.txt and (if well-behaved) obey it and not crawl the contents of the page at that URL. But this does not mean it will remove the URL itself from its index.

Again: robots.txt is intended to control crawler behavior, not search index visibility.

Google's page is a pretty good overview of this distinction:

https://developers.google.com/search/docs/advanced/robots/in...

logifail · on June 28, 2022

> Again: robots.txt is intended to control crawler behavior, not search index visibility.

I'm obviously not asking the question clearly, I'm wanting to stop bots from crawling (it's scraping that annoys me), not search engines from listing URIs.

If I want to completely stop a bot from crawling my site (in the sense of "retrieving my content"), won't robots.txt prevent that? Even in the case of the bot having obtained a valid list of my URIs but not the pages contents from a 3rd party source?

Lets say I email you a list of URIs on my site. My robots.txt forbids all crawlers. Are you allowed to give the list of URIs to your bot and retrieve the content?

snowwrestler · on June 28, 2022

You are correct: a bot that is well-behaved (follows robots.txt directions) will not crawl your site if your robots.txt forbids crawling.

rsbadger · on June 27, 2022

This is very useful information. You’d really hope that private emails would be excluded by default…

kenniskrag · on June 27, 2022

yes but for login tokens the bing bot would be able to login none the less. Probably the login url would be in some logs at microsoft or antivirus vendor. It sounds paranoid but basically the url is a cleartext password laying around.

rsbadger · on June 27, 2022

Yeh I think you're right, an extra JS step on the page that the email link leads to would help a lot.

mcv · on June 27, 2022

Sounds like anyone dealing with any sort of vaguely sensitive information through email, and certainly any corporation, should avoid using Outlook for anything.

The article is about email verification links, which is a pretty clear case where this can be dangerous, but tons of other links can get emailed without being intended for a wider audience.

Besides, the fact that Outlook shares anything related to the content of your email with the outside world is just completely unacceptable.

(Should private links be sent over unencrypted email? Probably not. But lots of stuff gets emailed that's not super secret and yet also not meant to be shared outside the company.)

phendrenad2 · on June 27, 2022

Or maybe you shouldn't rely on security through obscurity and instead should add a robots.txt as has been in the web standard since 1997.

rsbadger · on June 27, 2022

I do have a robots.txt to block this directory. But Bing only listens to that for what to crawl, not what to index.

rsbadger · on June 27, 2022

Exactly that.

foreigner · on June 27, 2022

There's a difference between adding the URLs to search engine results and accessing the URLs to scan for malware. The latter is quite common, lots of email hosts do that. It's not clear to me from the post if the former is actually happening - the author doesn't state that they found the links in Bing's results, just that they were accessed by BingBot.

rsbadger · on June 27, 2022

I found them in Bing results

tatersolid · on June 27, 2022

Are the URLs being served with a “noindex” header? Blocking crawls with robots.txt cannot de-list items from Google or other search engines.

> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results. If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

From https://developers.google.com/search/docs/advanced/robots/in...

gunapologist99 · on June 27, 2022

> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.

I realize that you are just the messenger and not the progenitor of that policy, so not addressing this to you, but: that is ridiculous. robots.txt is basically useless.

jeroenhd · on June 27, 2022

Robots.txt is a mechanism for providing instructions to automated crawlers but I don't think they've ever been promised to be used when a URL is manually or automatically submitted through other means (i.e. another site linking to yours). In those cases, a single page will probably be crawled, but the rest of the domain probably won't.

rsbadger · on June 27, 2022

They are now - I didn't think I had to as all the pages are naturally behind a login, it never crossed my mind that Bing would follow email links, let alone index them in search results.

ivanbakel · on June 27, 2022

That should be clearer in the article. It's not evident from the SO question or from what you've actually written. A screenshot of those search results would go a long way towards making the article sound more credible.

rsbadger · on June 27, 2022

Yeh I was a bit reluctant to post that as it doesn't look great for my app! But here's the results: https://www.bing.com/search?q=https%3A%2F%2Fshoprocket.io%2F...

martimarkov · on June 27, 2022

Could it be that they were taken from the MS Edge history? I mean still amazingly bad but just throwing it out there. Could explain the gmail ones as well

rsbadger · on June 27, 2022

Quite possibly…

ck2 · on June 27, 2022

And google harvests all your online purchase emails to log everything you've bought.

https://www.techspot.com/news/80134-google-uses-receipts-sen...

Just a reminder any email left on any online service over six months in the USA is allowed to be read by any law enforcement agency without a warrant.

You'd think these services would have a six-month auto-delete feature but nope.

There's good reason why there was a email server in the basement, everyone should have their email server where at least a physical warrant is needed.

heipei · on June 27, 2022

To all the folks that suggest preventing opening of single-use links by robots.txt or user-agent detection etc: Just don't. There are dozens of tools at use throughout the various stages of an email with URLs being delivered that will go out and fetch websites. You have to design any confirmation dialog so the user still has to click a button to confirm, otherwise any one of these tools might inadvertently trigger your confirmation.

tyingq · on June 27, 2022

>As of Feb 2017 Outlook (https://outlook.live.com/) scans emails

Makes me curious if only the free, online, Outlook does this. There's also paid O365 online Outlook and the fat client Outlook.

kotaKat · on June 27, 2022

Office 365 just seems to make links useless for security now.

Our 365 instance now turns every link into this massive monolith of safelink checking URLs through Microsoft, making literally every email undeterminable if it is a phishing attempt or otherwise without turning to pasting it into one of many online 'decoders'...

tyingq · on June 27, 2022

Oh, probably this thing: https://docs.microsoft.com/en-us/microsoft-365/security/offi...

Though that is optional and configurable.

tikkabhuna · on June 27, 2022

At work they enabled safelinks whilst all the mandatory training stated best practice was to check the links before clicking.

Its a shame those links can't have an alttext to show the real link.

cma · on June 27, 2022

You don't want to train users to disambiguate phishing with alt texts which can be spoofed in other contexts.

dubya · on June 27, 2022

It's not just the free client. My university uses O365 (?) and the links are in emails checked on other clients (Mail.app of ios/macos).

Admin has also turned on "You don't often get email for __" warnings that edit the email so that gets included in replies. Very useful when you could a new large cohort of student email correspondents each semester :(

rsbadger · on June 27, 2022

Exactly what I was thinking - what really worries me is it also seems to have happened to a lot of @gmail users too. I still can't figure out how Bing managed to find email tokens sent to gmail. Maybe users who connected their gmail account to Outlook...?

christophilus · on June 27, 2022

This would kind of break single-use links, no? That seems like a real nuisance.

oever · on June 27, 2022

The HTTP GET method is idempotent: it should behave the same way on multiple accesses.

A single use link, e.g. for resetting a password or confirming a subscription, will usually show a webpage with a form that does a POST. Once that POST has been performed, the single use link is used up.

Single use links will mostly have a one-time secret that should not be leaked. Mails that contain such links or any sensitive information should be encrypted.

weberer · on June 27, 2022

How do you send mail to an Outlook user and encrypt it so Microsoft can't snoop on it?

jeroenhd · on June 27, 2022

If your goal is to prevent third party software like spam filters and malware engines from triggering actions, you must require a second step that will send a POST/PUT/anything-that-isn't-idempotent request. You can copy the authentication code into a form field and do the entire thing without Javascript if you want to, but a second step is necessary.

If your goal is to hide your secrets from Microsoft, then send the email encrypted or don't send it to Microsoft's servers at all. This is practically impossible, it at least impractical in most cases. You can't control the hosting provider and software of your customers.

diegoperini · on June 27, 2022

You re-ask the password on the visited page before presenting the form responsible for the one time POST call.

rsbadger · on June 27, 2022

Exactly. From what I've heard today it sounds like most apps have an extra step between the email link and the login, usually a JS step, to check for bots.

Not adding that was my downfall I think.

Beltiras · on June 27, 2022

There's also robots.txt

kornhole · on June 27, 2022

I struggle to understand how private companies like mine are OK with MS reading all employee email and processing it through their AI. I get these daily creepy emails from MS saying that you said you would do this yesterday.. I have resorted to using burnernote.com, not to hide anything from my company but to hide it from MS who competes with us on some products. I guess burnernote.com will also not work anymore since it creates one-time links.

We are monitoring you for your protection.

plasma · on June 27, 2022

In my experience Gmail does this from time to time too, so any “load once” links won’t reliably work.

pluc · on June 27, 2022

I mean you can say that Twitter and Slack for example do it too, any service that generates a preview of your links, they'll crawl the URL you provide whether it's secret (eg sent in a private message) or not. Very very very few will stop at the "og:image" tags and such because why would they discard data about you?

batch12 · on June 27, 2022

I have observed that twitter's bot hits links within seconds of being tweeted. The traffic comes from several locations, not all twitter ASNs. One interesting source is Apple. Their bot/scanner hits soon after.

Aissen · on June 27, 2022

Anyone paying for the firehose access can do this.

batch12 · on June 27, 2022

That's true of course. What's interesting to me is that they've decided to pay for this access and visit the links so quickly. It must be pretty expensive or hard to get if only around two-dozen companies pay for access to the data[0].

[0] https://www.washingtonpost.com/technology/2022/06/08/elon-mu...

Aissen · on June 28, 2022

It might be that Apple is paying for the firehose as a data-source to bootstrap its search engine. Don't they have one accessible via Siri already ? (I don't follow Apple tech very closely).

mikro2nd · on June 27, 2022

Just wondering how this might be an attack vector for fucking with Bing... I can think of a couple of avenues; the most elegant would be if the URL itself triggered something within the scanner/URL processor; next up would be the content of the target page attacking the Bing infrastructure. I'd guess the backend processing is sandboxed, but it seems like an interesting avenue that a malicious actor might explore.

Don't try this at home, kids. :)

aasasd · on June 27, 2022

Sure, we'll do this right after cracking Google through Googlebot.

baisq · on June 27, 2022

I remember sending one-time use URLs in emails to customers and they would've expired by the time they clicked them because Outlook was opening them before they did.

Yeah yeah GET is idempotent and I shouldn't do that blah blah. That's not the point.

mordae · on June 27, 2022

Can this be exploited to confirm some action automatically?

Or to prevent some site from being indexed by flooding it with invalid links?

ipaddr · on June 27, 2022

This becomes the interesting piece. What can you do with this known side effect..

rsbadger · on June 27, 2022

Absolutely

api · on June 27, 2022

Everything spies on you unless proven otherwise. Seems to be a rule these days.

martin_drapeau · on June 27, 2022

In the B2B SaaS where I work we started using single use codes to log in for certain account types (non-admins). No password. We send you an email or an SMS with a 6-digit number. Copy/paste it to log in. Very much like 2FA except there is no password. The session lasts 30 days. The user can disconnect of course.

Curious what HN readers think. Is this secure? Sufficient?

yencabulator · on June 29, 2022

For what it's worth, here's Google admitting that GoogleBot causes POSTs:

https://developers.google.com/search/blog/2011/11/get-post-a...

Automatically triggered POST is not sufficient to keep the bots at bay.

They seem to be implying that only automatically triggered POST are acceptable, but that was also >10 years ago.

With the way things are going, it might be that any on-page confirmation buttons won't be sufficient to keep the bots at bay. Maybe it's time to fight back, check the user-agent, and serve the bots a CAPTCHA?

anonym29 · on June 27, 2022

If you're concerned about privacy, you shouldn't be using any Microsoft products, period.

hansel_der · on June 27, 2022

true, but for the last 30ish years nobody cares about that opinion because it is profitable to use the smallest common denominator, get shit done and call it a day.

mrjin · on June 27, 2022

Okay, I thought M$ was just a little bit better than $G. It turned to be as bad...

ipaddr · on June 27, 2022

Why would you think that? If M$ had the same position as google even more things would be closed source and more connected with law enforcement and less private.

mrjin · on June 28, 2022

I was naive to think so as M$ did not have as many obvious malicious moves as $G recently. But I forgot all those companies are there for money and for sure they will do whatever they can.

midislack · on June 27, 2022

Don’t Windows users already just accept whatever from MS? This is what you get with a proprietary operating system. Whatever you’re served.

Now stop complaining and look at the new ads in the Start Menu.

windows2020 · on June 27, 2022

I've noticed Office 365 Safe Links makes an OPTIONS request, not GET. So, restricting the endpoint to GET, via [HttpGet] decorator for example, may be a quick resolution.

HeavyStorm · on June 28, 2022

While I think a Turing check can easily solve the problem without much friction, this only increases my hatred for Outlook scanning. The worse part - to turn it off, you also have to turn off junk mail protection (well, used to, it's been a while since I tried).

Now, having my private links indexed by Bing is a bit too much!? I sincerely hope OP is mistaken and Bingbot is actually the outlook scanner.

rsbadger · on June 28, 2022

Unfortunately not - the links were indexed and shown in Bing search results

andix · on June 27, 2022

Can’t you block it with robots.txt or some similar method?

rsbadger · on June 27, 2022

It was blocked by robots.txt but Bing chose to ignore it. I even tried "blocking" the URLs in Bing webmaster tools today and this was the response:

"Block request denied We found that the URL submitted for block is important for Bing users and hence cannot be blocked through Bing Webmaster Tools.

We recommend that the best way to block URLs in this scenario is to add NOINDEX meta-tag to the HTML header of the page."

cube00 · on June 27, 2022

That is baffling logic. Sure, they think they know best and want to ignore the wishes of the owner of the web site. Why then respect a NOINDEX meta-tag instead of robots.txt?

rsbadger · on June 27, 2022

Exactly - seems the safest way is to explicitly block known bots by user agent from even reaching pages you don't want indexed.

LinuxBender · on June 27, 2022

No but one could block it with basic auth. That is how I keep Discord and Valve crawlers off my links.

progman32 · on June 27, 2022

Our company occasionally does "test phishes" to see how well people resist them. Every time, some of our most security minded engineers end up on the "clicked on the malicious link" lists, when all they did was forward the message to IT to report the phish. I'm wondering if the bingbot leak is the reason.

creeble · on June 27, 2022

How long does it take for them to check the link?

I sent an email with a unique link in it to my @Outlook.com account 6hrs ago, and there have been no visits to the link. The email is in my inbox (though I have not opened it).

Does this only happen on opening the email (in the Outlook web ui)?