I've had to deal with this with e-mail verification links and Auth0.
The user clicked the link after getting it in their mailbox but then Auth0 throws up an error page because the e-mail address has already been verified (by Outlook scanning).
The problem becomes worse if for some reason the mail ends up in the junk mail folder so the user thinks they've never received the mail but when you check it looks like the e-mail address was verified successfully.
That has caused a lot of annoying back and forth trying to figure out what the hell is going on. We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified. Super annoying.
Links like this are stupid regardless of Outlook's behaviour because they require a perfectly reliable client and network and user in a perfectly undisturbed flow. If I can't F5, if I double-click, if my mouse is wonky, my wifi is bad, my power goes out, my computer hangs, my DSL dies just after a click, if I accidentally close the tab.. there are any of a thousand reasons why abusing GET for a one-time-use page or redirect is horribly wrong.
It takes incredible arrogance to continue using them in order to "improve usability" given all the obvious and common cases where they completely destroy usability. The difficulty for a provider to verify they aren't sending you to a phishing or browser 0day page barely scratches the surface.
My father has insisted on doing this for over 20 years, but he doesn't know how to do it himself. I expect a password-reset phone call from him every 2 or 3 days and have done since 1998. Just recently he had someone from his bank's IT department call him directly about resetting his password over 500 times.
I'm not sure if he's still doing it but someone put together https://theuserisdrunk.com/ and https://theuserismymom.com/ a few years back... I wonder if you could do something similar here, given the level of absolute predictability that seems to be involved.
I sadly can't put my finger on what's so compelling about this, just that my "oh that person should talk to a UX team lead!" meter just went plink
Or "passwordless" login, and I love it. Not many people use password managers and will reuse passwords between websites (I.e. their bank and some random unsecured SaaS product). One-time emailed passwords are an easy way to avoid this problem and have a fairly secure site (mind you, it's only as secure as their email). You can layer 2FA on top of this too.
It's only annoying if the site is constantly timing you out so that every single visit you need to resend. Why not just use secure cookies to remember the user for say a week?
I had a similar case recently where I was getting the magic link in an email on my phone, and needed to copy it into Slack so I could click it on the laptop I wanted to actually log in on
This... was impossible to do, because by long-pressing on iOS to get the Copy prompt, iOS also goes ahead and opens a preview of the link next to it
Haha, I was in a restaurant and you paid through your phone. My browser updated so it closed between the thank you page and the payment click. State was lost so the thank you page was broken. The restaurant didn’t think I paid but my bank account said otherwise (this was a bank transfer via ideal, not credit card). Getting out of there without paying twice was entertaining.
>We ended up adding a custom page to handle e-mail validation so we could handle the situation where the user lands on the page and the address has already been verified.
That's a yikes from me! So I can sign up on your service as anyone with an Outlook account, without verification?
I'd assume the custom page has a random URL and requires entering the email address (requiring a match) or clicking a button. I've seen some account confirmation pages like that.
Discord let someone sign up with my gmail email address, sent an email verification link, and before I saw either the "welcome to Discord" or "please verify your email" links they'd already let the person in as me. I don't know if this is because of google crawling links from mail or some other kind of failure, but I wasn't pleased that Discord would let someone impersonate me.
Wow! I think you just figured out an issue I had while working in a previous company using Auth0, where the authentication token would expire before the user had actually gone there (so the user saw an error page when clicking), but on our side it looked like the user went there but dropped off directly after. Had maybe 1% of the users complaining about this, but we never found the root-cause (we moved to our own authentication before we could figure it out). This has to have been why. Thanks for sharing this!
HTTP GET requests are supposed to be idempotent, meaning that when you call an URL twice it should not lead to any different result compared to calling it once. This is part of the HTTP standard.
So while I think what Outlook does here is wrong, what these webpages do is simply a bug that should be fixed and shows a lack of understanding of HTTP.
I think that's a bit too much. Nothing in that suggests that they are breaking anything in the HTTP specification. You're right that GET requests has to be idempotent, but the exchange from the single-time use code you get in email with the API token, is most likely behind a non-GET request (like POST). The HTTP server responds to GET requests with the static assets (HTML/CSS/JS), but then the static assets has JavaScript that calls the POST endpoint for the exchange.
At least that's my guess. I agree it's a bug on their side, and they should fix it. But I think it's more of a UX issue than breaking the protocol.
GET is supposed to be safe as well as idempotent though.
Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. (RFC 7231, § 4.2.1)
I wouldn't go around saying others don't understand http. Puts you in a very awkward position when you are wrong. Which, you are. Idempotency and safety are separate concepts related to http. But, you do you. Take the advice however you want.
IMHO it is clearly a wrong assumption on the side of any such sender. A verification link should have clear definite actions for the user receiving it:
- It's me, let me confirm my address
- I never signed up for this heap of diamonds
Whenever I (not even a bot) click or follow a link from my mailbox, by accident or on purpose, I don't expect that to validate an account for anyone else, but me, intentionally, using a password I know.
I had a customer who had some sort of software that followed the link in the email we sent (no big deal so far), and THEN would follow every link and button on that page.
We had a handy quick decline and accept button on there so they were auto declining things…
I didn’t hate email until I got into web development….
Yup. It was a pain. I have no idea who wrote that and thought it was a good idea…
It was a super basic web form too. Probably the most html markup standard thing we have. Nothing strange about it that could have triggered some sort of strange behavior.
Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks. If Bing has not seen the web page before and it's not in the Bing dangerous web page index it first needs to check it to make a determination of if it's a phishing/malware page by scanning/indexing it before returning that outcome back to Outlook to flag the email as dangerous.
> Microsoft does this because they're security scanning / checking all links in every Outlook email for known phishing and malware attacks
The problem with that is that the logic is broken. Microsoft cannot possibly know all phishing sites, especially for smaller things. By obfuscating the link the user can no longer verify it by themselves without clicking, but Microsoft will say it's safe. So the user is left with a false sense of security and are worse off.
It only works for huge sites ( e.g. mytwitter.lol phishing for twitter and similar), but drastically lowers the chance of less high profile phishing being caught.
The problem with that is that the logic is broken. If 99.99% of phishing can be prevented this way, what problem do you have with it? Would you really catch that 0.01% that an automated system wouldn't?
You mean you don't verify calls to action via other information channels? Fairly regularly I get phishing emails that correctly spoof the crypto headers of major sites (e.g., because of a misconfigured mail service). If an email asks me to do something, it either doesn't get done or I cover my ass in as many ways as possible, no exceptions.
That isn't by itself an argument against a good automated system -- I definitely like not having to sift through most of that garbage, but catching the 0.01% should be a routine practice, not something that seems like an insurmountable burden.
Maybe you should implement a "feature" that serves a simple static HTML page <p>This webpage is safe.</p> to "bingbot" and serve the real page to everyone else.
I believe the recommended practice is to hover over the URL before clicking the link.
If you do so, in Outlook, there will be a pop that shows "Original URL: XXX". This allows users to make a determination for themselves whether the link is safe or not.
We got some security courses about that too. Unfortunately, outlook replaces all of them with some safelink url rewriting, so the only way left to find out if a link is scammy is clicking it.
It is in fact possible to extract a destination URL from a Safelink one without clicking it. For the full link this can be tedious, but identifying the domain can still be done quickly.
For normal URLs, I agree. But in this case you have adversarial urls. Suppose the scammer puts some http and www.google.com in the url parameters, after some randomly generated 8 characters dot someobscuretld site.
I don't trust myself enough to be 100% sure I can decode an URLencoded misleading mess perfectly all the time.
They already hid urls in the username of the url, like www.google.com.unholymessherethatscrollsoutoftheurlbar @ malignantdomainnotgoogle.blah
Microsoft offers this as a security product. It's impossible to know all links but known ones can be blocked to limit future issues. Other enterprise email security products scan the links and follow all the redirects as well. After delivery a incredibly small amount of time and every link is "clicked" in an email with those products.
Cool, so we should just stop building and running almost anything in existence because it's not all-encompassing? That sounds like a suboptimal path forward.
But there is nothing to indicate either in the post or in the referenced SO thread that the URLs are published to the search results. They are visited by bingbot, that much seems confirmed, but there’s no example where one of these results shows up in the public search results.
I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.
Yes, but there is no indication they are publishing it in the search results.
The original post is just complaining that the malware scanning is visiting the links.
They come to the following conclusion
>This effectively makes all one-time use links like login/pass-reset/etc useless.
Which we all know is not true because sites like onetimesecret.com allow for entering a separate password to prevent this sort of thing when it does happen.
It would be an interesting discussion to talk about what Microsoft's whitelisting process looks like, but the original article doesn't seem to understand what is going on well enough to drive the conversation in that direction.
All pages with one click links should have no index follow or no index no follow. Your seo consultant (if you have one) should have advised you on this.
I am not saying this excuses the privacy violation but just suggesting there are things we can do...
Worse would be links that are private to the people who posses the url. Like a private video on YouTube or a private document in google docs. The security depends on the URL being secret. This would silently publish secret information.
If those pages have no proper meta tags or robots.txt, there’s absolutely nothing wrong with this. Security by obscurity was never a good approach; from Proxies to security scanners, there has always been software that crawls unassuming URLs and published the results somewhere, if only a report to the admin.
If you can say for certain that the links being published are coming from the malware scanning, and not being taken from users' browser sessions that are using Microsoft Edge you should elaborate on this.
I would be pretty mortified if browsers were using user browser sessions to scan content and pass it to bingbot…? What about if you’re browsing something local? Or your bank account?
It is common for corporate email security appliances as well. URLs should not be used for authentication neither should email. I really want to pick brains of people that work on these types of systems to see why they don't think so.
Many people (most?) prefer to signup to services by email address. To do so, those email addresses must be verified. How would you verify it without sending them an email link?
You can verify validity of an email like that, no issue there. Just don't use that as a factor authentication. Control over an email account should not trump passwords (what you know) or proper 2fa (what you have, typically, email can be 2fa like sms and like sms it is not a good choice). If a person proves they control an email account then you ask them for additional info like secret questions or other information configured during registration.
I should not be able to take over your life because I compromised your phone which has sms, TOTP app and email.
It is not. You can initiate password reser via email but additional recovery controls like security questions should still be required. In an ideal world you have 2fa as well, if you reset that via email as well the it isn't actually 2fa, it is email based 1fa with extra steps. If your 2fa has a separate mechanism for recovery as well,that would be ideal. If it was my webapp, I would use hashes of 3 answers to user chosen questions, hashes in the browser/client. It could be an object pairing as well, every user gets a list of 30 objects or so and they pick 3 pairs as a recovery combination.
If this were security scanning, why does it identify itself as BingBot? Doesn't that just allow cloaking and offer an easy workaround for any adversary with a modicum of intelligence?
The HTTP GET method is idempotent by specification. Visiting a webpage should not trigger password resets or any other actions by itself. If that's a problem then it's the site's fault for being defective.
You’re right that it was a bit of an oversight on my behalf, as the links were only generated after a verified human user action (signup) I had assumed the 1 time links to their email would be safe. But regardless of the link action, it shouldn’t be passing that data to Bingbot to crawl and (possibly) index in search engine results. Private email data should not be shared with search engine crawlers IMO.
So how do you implement a "one click unsubscribe" link in an email? They're on GET requests. You could use JavaScript on the resulting page to then trigger the unsubscribe but bots are now running JavaScript as well.
… but it doesn’t matter if the client is compromised, because all it would hurt is the user, right? If the client was compromised, it could just not send anything to your servers, or ignore the results, or …
Dunning-Kruger level of understanding: Never trust the client for anything ever, it's unreliable, everything must be off client.
Never mind the client is literally the interface into your system, so it being compromised is already game over for an application where the user is most vulnerable party you wanted to protect...
Deep understanding: Trusting the client requires a well thought out security model.
If the client is hacked in this case, they already have full control over what the user sees, they can cut out your remote check.
Maybe a good balance would be to hash the root of the URLs and compare those, or use fuzzy hashing on page contents, just so that the backend isn't getting a bunch of private urls that might accidentally get logged somewhere.
Trades detecting stuff hidden behind redirects for less liability on your backend, something to possibly consider depending on functional requirements.
In this example you're right. For something like scanning a site for malicious content, on-device is not a bad approach. It decreases the amount of data sent to the server.
The client has a much bigger issue to worry about if the client-side malware scanning has been compromised. Malware could modify the UI/network calls such that your server-side scanning displays a positive result anyway.
You have to trust the client to display information to the user at some point. Link malware scanning that job can safely be delegated to the client. Authentication cannot.
I have observed this, but also found that BingBot modifies the query string parameters of your URL. It does this by changing a character of the URL, possibly in an attempt to find new pages?
I noticed this because I generate links with a signed token to ensure integrity and started receving invalid token crash reports in Sentry, always from BingBot..
To fix this I had to move the tokens from the query string into the URL itself to avoid BingBot changing it. eg.
I would guess that this is probably done on purpose to avoid tripping one-time-use links. Seems like a good way to hide malware from the scanner though.
I've finally found someone else who's seen this behaviour!
I've noticed this too, and I found (in my case anyway) that Bing/Outlook seems to Rot13 the keys of the query parameters - is this what you're seeing too?
Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.
I don't see the problem here, all services need to do is add a page that's says "welcome back, $Username, click here to log in!" that sends a POST request to do any serious confirmation without breaking any specifications.
Microsoft claims the visiting not is BingBot but it's probably just SmartScreen system checking for malicious links/downloads/etc. like many cloud integrated security products do these days.
I can set my browser to pretend I'm BingBot, you can't derive anything meaningful from the user agent. Unless you find your secret URLs in Bing's search results, your secret links aren't actually being monitored by a search engine.
I’m fairly certain they are. My links ended up indexed in Bing search results. The only place they were ever rendered was in private emails to users. Bing should not be indexing that.
You're right, it shouldn't. It's possible that they're fetching these URLs from their customers' browsing history and submitting those (external submissions follow different crawling rules, sometimes bypassing robots.txt). Bing's webmaster information says so, at least: https://www.bing.com/webmasters/help/webmasters-guidelines-3...
For a bit of added "fun", Google will do the same, but if you add a page to robots.txt and set noindex then they won't process the noindex parameter and external indexing sources might still generates search results: https://developers.google.com/search/docs/advanced/crawling/...
That's a bit of a narrow view on this problem. When sending a link to someone, you expect that someone to view the link. Not some random mail service. Who gave the mail server permission to access the page? What if it contains copyrighted material? What if it's one of the millions of pages which don't follow the HTTP design philosophy to the letter?
> Who gave the mail server permission to access the page?
The recipient of the email, or their employer's IT department that is paying another company for mail services.
If you send me an email with a link then I do believe I have the right to send that link to a third party service that can validate that it's not malicious. If I decide to sign up for a mail service that promises to protect me from phishing emails, then I [0] expect said service to read the emails I receive and examine the links within them. I would be upset if the service used the info I share with them for purposes other than keeping me safe, though.
I readily admit that I have, at various points in my life, signed up for services without reading the entire TOS that I agreed to. I try to choose companies that I feel I can trust to not abuse me too much, and sometimes I avoid certain services because I don't trust the company behind them enough to respect my privacy.
[0] I acknowledge that not everyone is as knowledgeable as me, and many people might not realize that this is how the protection works. So if the argument is more education, I'm in favor.
> When sending a link to someone, you expect that someone to view the link
That sounds like a narrow view of email. This has never ever been true. Corporate firewalls have always opened links, and many users use tracking blockers in their email provider that automatically opens incoming email and detect tracking cookies. I cannot stress strongly enough that you cannot rely on only one "user" clicking a link.
When sending a link to someone, you expect your antivirus, your email provider, your email provider's spam filter, any intermediate email providers, the recipient email provider's spam filter, the recipient's email provider, the recipient's antivirus, your recipient's mail client, your recipient, and any other people who the message will be forwarded to, to see the link and evaluate it. The email standard is pretty clear that any number of intermediate servers and services can and will be able to see what you're sending.
Email isn't WhatsApp, there are probably at least three or four parties who will scan the link in any way they like. If you control your side you can make sure there are only one or two parties scanning the email en route, but the number can never be guaranteed to be zero without workarounds.
Who gave the mail server permission to access the page? The person who set up email on the domain. If you don't trust the hostmaster, don't send email to that domain. What if it contains copyrighted materials? Well, you just shared a plaintext link with a whole bunch of people, depending on your local legislation you may be in trouble.
You can't even expect a link clicked once by a single user in a browser to only appear once on the server side. TCP connections get dropped and retried. This isn't some kind of philosophical interpretation of a mystical protocol spec, this happens in real life. If you use POST/PUT/whatever requests, the user agent will prompt the user if they really want to repeat a request; this protection has been built in for years. It's just how browsers work and how they've been working for decades.
If your recipient is behind a proxy, the link may be visited several times each hour for up to a month while the proxy refreshes its cache. This was a more prevalent problem back in the day, these days web proxies are mostly a thing of the past; however, proxies still exist, and if you don't pay attention to those things they will bite you in the ass.
In real life bugs happen. That's fine in these cases, web dev isn't exactly rocket science, bugs are tolerated and can be fixed. The bug here isn't the fact that links get visited twice, though: the bug here is that the developers who set up their magical links forgot about idempotency when they wrote their code, or they chose to ignore the problem because they never ran into it themselves. Either way, the responsibility to get it fixed isn't on anyone but the party violating the spec.
As a workaround, S/MIME or PGP should work around most of these problems as intermediate servers can't see what's going on. What the client's machine will do with the decrypted message is still up to interpretation, of course.
> Outlook will only send GET requests, which are idempotent unless you're ignoring the spec. A message saying "this code has already been used" after sending a GET request is a bug.
Alright, so imagine this: we have two endpoints GET "/page" and "POST /increment". Making a POST request to "/increment" increments a counter kept in memory and returns it's new value. The GET "/page" endpoint returns a HTML file, which contains JavaScript code that when executed, calls the "/increment" endpoint.
Are we now breaking the HTTP specification saying GET requests has to be idempotent if we visit "/page" in our browser? I think not, but this is sometimes how pages are implemented, which robots are gonna have to deal with, as otherwise many would consider it broken.
Don't get me wrong, I think it's a shitty implementation as well. But is it breaking the HTTP specification? Unlikely.
I don't think it's breaking the spec per se, but web crawlers execute javascript and people hit reload on their browsers, sometimes accidentally. Automating this process may not be the solution here.
Personally, I think web crawlers like Bing shouldn't be executing javascript at all but front developers can't go without their client side rendering frameworks so search engines are more or less forced to.
As for a security mechanism, you want to emulate a browser as closely as possible to detect tricks like redirects from safe domains to attack domains and obfuscated URL crap. I'd expect any automated, non-interactive code to execute in a security analysis sandbox.
Is this breaking the standard? Who knows. What is a cloud antivirus but a web user agent running in a data center? The email protocol doesn't specify how the client should deal with links, the robots.txt only works for spiders, not for manually submitted URLs like those clicked in emails, and without a noindex tag you're going to see your page indexed by the mail provider company regardless of what your robots file says.
I think in theory your solution solves the spec breaking problem, but it doesn't solve the problem in practice because there are many other components for which there are no standards and defensive programming is required.
From the spec, although it doesn't explicitly say 'dont automatically send a POST upon opening a GET', I think it's fairly clear its against the spirit of what a GET should represent to the user if it isn't a safe POST request.
"In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them."
Even if Microsoft claim this is about security scanning, isn't it fairly trivial to configure your webserver to serve up different content depending on the User-Agent request header?
BingBot scans the link, gets a dummy page with 'clean' content, Microsoft delivers the email message to the user, user clicks through the link with actual browser, gets phishing / malware content...
Isn't the quick fix that you arrive at some page, and there have to press a button or load some JS to do some action? AFAIK most email providers (like Gmail) will also visit links. Therefore you shouldn't do actions directly on the GET request. For instance if you have an unsubscribe link and all you have to do is visit that address, most of your subscribers will be accidentally unsubscribed.
Same if you paste a link in Slack/FB/Discord/Twitter whatever, they will visit the page to create a preview. GET requests shouldn't have side effects.
> AFAIK most email providers (like Gmail) will also visit links.
I keep hearing this, but our newsletter system has been using GET unsubscribe links since at least 2007 (but probably longer), and we never found a wave of Gmail users unsubscribing, we still have a lot of them. I wonder if this is simply an urban legend, if Gmail tries to recognize unsubscribe links, or if there is something else going on.
We do not :) It was never an issue there. Bots get ignored for stats, Google IPs get blocked for Ads (Google seems to think every ad link has to be visited by a ton of bots, our customers actually started complaining about the traffic)
(I made a lot of changes today when testing all, including "visit as Bingbot" from their webmaster tools with and without the URL blocked by robots.txt)
> was that Bing actually indexed them. (even though my robots.txt said not to)
Never mind indexing them (ie publishing them at Bing.com), if URLs are disallowed in robots.txt then Bing shouldn't even be retrieving them, even if only to scan the content for malware!
This is a common misconception about robots.txt. It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.
Robots.txt is not a reliable way to exclude pages from search engine indexes. That is not what it is for. It is for controlling crawler behavior.
The only reliable way to exclude a URL from a search engine index is to serve “noindex” on that URL, either with a metatag or an HTTP header, or both.
> It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.
I must confess I've been sceptial of robots.txt for a very long time (if I want to stop bots I serve them HTTP 403 Forbidden using .htaccess or similar).
Be that as it may, it appears I'm also confused about what robots.txt does and doesn't do.
Assuming you're correct: let's say I run EvilBot which scrapes sites and want to scrape your site example.com, but your robots.txt only allows Googlebot and disallows everyone else. Am I really OK to:
1. scrape the SERPs from google.com which mention your site ("site:example.com")
then
2. using that list of URIs, use my EvilBot to scrape your site, without needing to touch or respect your robots.txt, since I got the list of URIs on your site from Google, not by scraping example.com directly?
Your step 1 is enough for URLs to be indexed. Even a well-behaved search engine does not need to visit your site to index a URL, including whatever anchor text pointed at it.
If the crawler does then visit your site, it will see your robots.txt and (if well-behaved) obey it and not crawl the contents of the page at that URL. But this does not mean it will remove the URL itself from its index.
Again: robots.txt is intended to control crawler behavior, not search index visibility.
Google's page is a pretty good overview of this distinction:
> Again: robots.txt is intended to control crawler behavior, not search index visibility.
I'm obviously not asking the question clearly, I'm wanting to stop bots from crawling (it's scraping that annoys me), not search engines from listing URIs.
If I want to completely stop a bot from crawling my site (in the sense of "retrieving my content"), won't robots.txt prevent that? Even in the case of the bot having obtained a valid list of my URIs but not the pages contents from a 3rd party source?
Lets say I email you a list of URIs on my site. My robots.txt forbids all crawlers. Are you allowed to give the list of URIs to your bot and retrieve the content?
yes but for login tokens the bing bot would be able to login none the less. Probably the login url would be in some logs at microsoft or antivirus vendor. It sounds paranoid but basically the url is a cleartext password laying around.
Sounds like anyone dealing with any sort of vaguely sensitive information through email, and certainly any corporation, should avoid using Outlook for anything.
The article is about email verification links, which is a pretty clear case where this can be dangerous, but tons of other links can get emailed without being intended for a wider audience.
Besides, the fact that Outlook shares anything related to the content of your email with the outside world is just completely unacceptable.
(Should private links be sent over unencrypted email? Probably not. But lots of stuff gets emailed that's not super secret and yet also not meant to be shared outside the company.)
There's a difference between adding the URLs to search engine results and accessing the URLs to scan for malware. The latter is quite common, lots of email hosts do that. It's not clear to me from the post if the former is actually happening - the author doesn't state that they found the links in Bing's results, just that they were accessed by BingBot.
Are the URLs being served with a “noindex” header? Blocking crawls with robots.txt cannot de-list items from Google or other search engines.
> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results. If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.
> Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.
I realize that you are just the messenger and not the progenitor of that policy, so not addressing this to you, but: that is ridiculous. robots.txt is basically useless.
Robots.txt is a mechanism for providing instructions to automated crawlers but I don't think they've ever been promised to be used when a URL is manually or automatically submitted through other means (i.e. another site linking to yours). In those cases, a single page will probably be crawled, but the rest of the domain probably won't.
They are now - I didn't think I had to as all the pages are naturally behind a login, it never crossed my mind that Bing would follow email links, let alone index them in search results.
That should be clearer in the article. It's not evident from the SO question or from what you've actually written. A screenshot of those search results would go a long way towards making the article sound more credible.
Could it be that they were taken from the MS Edge history? I mean still amazingly bad but just throwing it out there. Could explain the gmail ones as well
To all the folks that suggest preventing opening of single-use links by robots.txt or user-agent detection etc: Just don't. There are dozens of tools at use throughout the various stages of an email with URLs being delivered that will go out and fetch websites. You have to design any confirmation dialog so the user still has to click a button to confirm, otherwise any one of these tools might inadvertently trigger your confirmation.
Office 365 just seems to make links useless for security now.
Our 365 instance now turns every link into this massive monolith of safelink checking URLs through Microsoft, making literally every email undeterminable if it is a phishing attempt or otherwise without turning to pasting it into one of many online 'decoders'...
It's not just the free client. My university uses O365 (?) and the links are in emails checked on other clients (Mail.app of ios/macos).
Admin has also turned on "You don't often get email for __" warnings that edit the email so that gets included in replies. Very useful when you could a new large cohort of student email correspondents each semester :(
Exactly what I was thinking - what really worries me is it also seems to have happened to a lot of @gmail users too. I still can't figure out how Bing managed to find email tokens sent to gmail. Maybe users who connected their gmail account to Outlook...?
The HTTP GET method is idempotent: it should behave the same way on multiple accesses.
A single use link, e.g. for resetting a password or confirming a subscription, will usually show a webpage with a form that does a POST. Once that POST has been performed, the single use link is used up.
Single use links will mostly have a one-time secret that should not be leaked. Mails that contain such links or any sensitive information should be encrypted.
If your goal is to prevent third party software like spam filters and malware engines from triggering actions, you must require a second step that will send a POST/PUT/anything-that-isn't-idempotent request. You can copy the authentication code into a form field and do the entire thing without Javascript if you want to, but a second step is necessary.
If your goal is to hide your secrets from Microsoft, then send the email encrypted or don't send it to Microsoft's servers at all. This is practically impossible, it at least impractical in most cases. You can't control the hosting provider and software of your customers.
Exactly. From what I've heard today it sounds like most apps have an extra step between the email link and the login, usually a JS step, to check for bots.
I struggle to understand how private companies like mine are OK with MS reading all employee email and processing it through their AI. I get these daily creepy emails from MS saying that you said you would do this yesterday.. I have resorted to using burnernote.com, not to hide anything from my company but to hide it from MS who competes with us on some products. I guess burnernote.com will also not work anymore since it creates one-time links.
I mean you can say that Twitter and Slack for example do it too, any service that generates a preview of your links, they'll crawl the URL you provide whether it's secret (eg sent in a private message) or not. Very very very few will stop at the "og:image" tags and such because why would they discard data about you?
I have observed that twitter's bot hits links within seconds of being tweeted. The traffic comes from several locations, not all twitter ASNs. One interesting source is Apple. Their bot/scanner hits soon after.
That's true of course. What's interesting to me is that they've decided to pay for this access and visit the links so quickly. It must be pretty expensive or hard to get if only around two-dozen companies pay for access to the data[0].
It might be that Apple is paying for the firehose as a data-source to bootstrap its search engine. Don't they have one accessible via Siri already ? (I don't follow Apple tech very closely).
Just wondering how this might be an attack vector for fucking with Bing... I can think of a couple of avenues; the most elegant would be if the URL itself triggered something within the scanner/URL processor; next up would be the content of the target page attacking the Bing infrastructure. I'd guess the backend processing is sandboxed, but it seems like an interesting avenue that a malicious actor might explore.
I remember sending one-time use URLs in emails to customers and they would've expired by the time they clicked them because Outlook was opening them before they did.
Yeah yeah GET is idempotent and I shouldn't do that blah blah. That's not the point.
In the B2B SaaS where I work we started using single use codes to log in for certain account types (non-admins). No password. We send you an email or an SMS with a 6-digit number. Copy/paste it to log in. Very much like 2FA except there is no password. The session lasts 30 days. The user can disconnect of course.
Curious what HN readers think. Is this secure? Sufficient?
Automatically triggered POST is not sufficient to keep the bots at bay.
They seem to be implying that only automatically triggered POST are acceptable, but that was also >10 years ago.
With the way things are going, it might be that any on-page confirmation buttons won't be sufficient to keep the bots at bay. Maybe it's time to fight back, check the user-agent, and serve the bots a CAPTCHA?
true, but for the last 30ish years nobody cares about that opinion because it is profitable to use the smallest common denominator, get shit done and call it a day.
Why would you think that? If M$ had the same position as google even more things would be closed source and more connected with law enforcement and less private.
I was naive to think so as M$ did not have as many obvious malicious moves as $G recently.
But I forgot all those companies are there for money and for sure they will do whatever they can.
I've noticed Office 365 Safe Links makes an OPTIONS request, not GET. So, restricting the endpoint to GET, via [HttpGet] decorator for example, may be a quick resolution.
While I think a Turing check can easily solve the problem without much friction, this only increases my hatred for Outlook scanning. The worse part - to turn it off, you also have to turn off junk mail protection (well, used to, it's been a while since I tried).
Now, having my private links indexed by Bing is a bit too much!? I sincerely hope OP is mistaken and Bingbot is actually the outlook scanner.
That is baffling logic. Sure, they think they know best and want to ignore the wishes of the owner of the web site. Why then respect a NOINDEX meta-tag instead of robots.txt?
Our company occasionally does "test phishes" to see how well people resist them. Every time, some of our most security minded engineers end up on the "clicked on the malicious link" lists, when all they did was forward the message to IT to report the phish. I'm wondering if the bingbot leak is the reason.
I sent an email with a unique link in it to my @Outlook.com account 6hrs ago, and there have been no visits to the link. The email is in my inbox (though I have not opened it).
Does this only happen on opening the email (in the Outlook web ui)?
Isn't this just the "link preview" feature, that is enabled by default in outlook?
Many email clients generate link previews so that they can display a thumbnail of the webpage. Would seem to be necessary to filter out those referrers from the validate link
“Private links” should be covered by robots.txt. The only case I see this happening is for those “anyone with link” shares and those are easy to cover.
Anything private should ideally be put behind authentication. If that isn't possible, than robots.txt. Search engines are _meant_ to index everything that is publicly accessible and not blacklisted on robots.txt.
More likely scanning for vulnerabilities or generating of previews. Magic links don’t work anymore. Therefor most services sends a code or something that you have to enter on a generic page.
Not the least surprised. By accepting the EULA and using the service free of monetary charge, you are instead paying with the contents of your e-mails and adress-book.
If it’s Outlook doing the dirty work then running your own mail server won’t get rid of this problem. Outlook is the client reading the links regardless of mail server.
Also, somewhat a different topic, people that run their own mail servers might also still use Outlook.
Exactly. I also noticed Bing had accessed some non microsoft tokens too. Even gmail accounts were affected. I assume some people have connected their gmail account to the outlook client?
tldr: In 2022, whether you are a paying customer or a free customer, YOU are the product and you will be squeezed for all you got (not specific to Microsoft at all)
I don't think that's very surprising for most people, the real takeaway is that not only will Bing read your emails, but they may also index any links you send and serve them in search results.
Agree, in my experience storage buckets are always private by default, and you must take several specific steps to make them public, ignoring the very big warnings sprinkled in each confirmation page along the way.
Are there any cloud vendors that don't follow this approach?
Almost all of them.
A good example is Dropbox link you send to someone. I could generate this link to a private file in my Dropbox, email it you, and Bing (may) index it.
Which would be correct, considering that ownership implies full and complete right of dominion over said entity, which you simply don't have. You could be locked out of your account with no means of getting back access, you can delete your data but have no guarantees that the data has been deleted, Google may create 'derivative works' on your data (see the terms of use) without your permission or will provide data about your account to authorities, etc.. That is not ownership, that's renting.