G’day,
I’m on a 100/40Mbps HFC plan, and have an ongoing issue where the internet in general (browsing, file access, email, cloud hosted products, etc) will just grind to halt.
For example, I can be accessing our CMMS and suddenly a page load will take 30+ seconds to complete. Never times out, just takes forever. Or I’m using our accounting software which syncs remotely and saving an invoice or opening a purchase order will stall for a minute.
This behaviour goes on for maybe 5 minutes or so and then goes away again. It can occur once or twice in a 10 hour day at the office, or not at all, or sometimes half a dozen times in a one hour period.
- Local network use is unaffected (for e.g. accessing SMB shares to a local server)
- All PCs and laptops connected to the LAN are affected so its not PC-specific.
- Ping is unaffected and hovers around 12ms to geographically close remote servers, with no packet loss or jitter.
- Speedtests of any kind always return around 95/35Mbps at any time be it peak / off peak / when problem is occurring / when problem is not occurring
- VOIP does not seem to be affected despite being on the same network and I can talk on the phone while the internet is otherwise wading its way through treacle.
- Happens with my current ISP (Leaptel), but also happened the previous ISP (Aussie Broadband) who are 100% completely different companies and I believe use completely different peering/routing/backhaul/etc.
- DNS seems irrelevant and occurs using either the ISP DNS, Cloudflare, Google, or Quad9
- Some websites like Facebook and Google work, but other websites like Lemmy (any instance), Reddit, my CMMS, various wholesaler sites hosted both in AU and worldwide, are affected.
Are there any steps I can take to try and identify what causes this random delay? Its just enough to be really frustrating, especially when you’re trying to look up something while on the phone and have to be like “so yeah hows the wife? hows the kids? hows the…dog? … pet bird doing anything interesting?” as you wait for a damn page to load. I need fast internet so I dont need to make small talk dammit.
PCs are all on cat5e or cat6 (depending on when the cabling was run), to a Ubiquiti Dream Machine SE which is connected via cat6 to the NBN HFC modem.
unplug each of the tubes and blow in them. reconnect
First thought was ISP intermittent packet loss, but
Ping is unaffected
no packet loss or jitter
Speedtests of any kind always return when problem is occurring / when problem is not occurringSuggests otherwise. My second thought was DNS crashing, but
DNS seems irrelevant
You already got it covered.
Some websites like Facebook and Google work
VOIP does not seem to be affectedReally weird situation! Try using wireshark to listen on the interface and observe what’s happening. Are packets going out but none returning? Are they returning with errors? Retransmissions? Are some destinations fine but others get no reply?
Could you geographically locate the IPs that work vs. the IPs that don’t? My next suspicion is that there is some upstream backbone link that cuts out, so stuff with local CDNs like facebook continue to work, but a lemmy server on another continent is unreachable. Try traceroute.
Unplug everything from the router. If the router has a speed test built-in, use it and see the baseline performance. Start plugging stuff in and see when it starts going to shit.
I’ll share my input, although it’s primarily speculation and a smidge of deductive reasoning.
Given these three particular pieces of information:
- Local network use is unaffected (for e.g. accessing SMB shares to a local server)
- Happens with my current ISP (Leaptel), but also happened the previous ISP (Aussie Broadband)
- Ping is unaffected and hovers around 12ms to geographically close remote servers, with no packet loss or jitter.
My first instinct is the issue may be upstream (non-local) network congestion. Since it appears that connections are slowing to a crawl rather than dropping packets. Ping requests don’t seem to suffer, but they’re a lot smaller than loading content via CMMS, Reddit, etc. You mentioned it could happen twice or more in a 10 hour shift, or sometimes not at all; network congestion being highly variable could explain this.
Are you in a remote area? If so, there may not be much nearby infrastructure (routers) to handle the big spikes in traffic when everyone in the immediate area clocks in to work at 9am, or gets back from lunch around 1pm, etc. If that’s the case, the local routers would get overwhelmed regularly by congestion and packet delivery times would suffer. This could also happen in more densely populated areas, depending on what the local infrastructure looks like.
Though I’m not entirely sure how to explain speed tests not suffering if congestion is the issue; unless the particular routes to the geographically-close test servers aren’t congested (because large numbers of people are trying to connect to real services, not the speed tests, during these congestion times).
The fact that some live services like Google & Facebook load while others like Reddit and Lemmy do not could be explained by the difference in those services’ respective high-availability (HA) solutions. Facebook and Google don’t typically drop below 99.95%-ish uptime because they scale their server infrastructure very aggressively to meet demand. But even huge services like Reddit have considerably more downtime than Facebook or Google (Reddit seems to have major outages several times a year, while Google and Facebook do not). Some upstream services having more servers to handle more requests more quickly could account for the inconsistent ability to load websites during this congestion.
I’m not sure the best way to test this hypothesis, though. Given how much troubleshooting and information gathering you’ve already done, this is a tricky one.
Some websites like Facebook and Google work, but other websites like Lemmy (any instance), Reddit, my CMMS, various wholesaler sites hosted both in AU and worldwide, are affected.
I wonder if IPv4 is somehow wonky, but IPv6 is working fine? Since Facebook and Google definitely support IPv6, the others may not (although Reddit should too).
You could try comparing
ping -4
andping -6
when it happens. That is if your network supports IPv6.If you do get any inconsistencies with ping, you could also try experimenting with
traceroute
/tracert
, to see where the delay happens.If you have a good router, you can usually monitor all the traffic coming and going through the whole network. Either there will already be a panel in the router’s homepage for it or you can flash DD-WRT (or similar firmware) to it.
Sounds like it’s a protocol level problem. Are you sure it’s not DNS, as in you’ve verified DNS responds in a reasonable time frame when it’s happening? Other HTTP requests complete normally?
See if you can replicate on a VPN.
Since the same behavior happens on two ISPs it’s probably something local. Get a packet trace at the router for the most sensitive protocol and see what’s happening.
Wild speculation: some local packets are getting dumped sometimes and the slow to recover protocols give you the timeout behaviors.