Diagnosis of Cloud Latency for Cloud or Hybrid Clients - Additional Partner Information

Diagnosis of Cloud Latency for Cloud or Hybrid Clients - Additional Partner Information

Problem
 
The following does not apply to Firewall Redirection / I-series / IPSEC traffic routing.
 
Based on experience the likely problems are in this order
  • Firewall interruption
  • Client is using the wrong cluster
  • Client has some issue with local DNS
  • There is a problem between the client and the cluster proxy server
 
This article is intended to help you work out if this issue applies to the case, and also can be used to ask the customer to help define the problem. There are exceptions but the following is based on collective experience.
 
  • Site in this case is origin server, location is the client's actual location (not the customer account location)
  • Where COPs is referenced this is the internal group Cloud Operations who manage the clusters.
  • Where Backline is referenced this is the part of Technical Support who escalate cases to Cloud Operations / Development Engineering.

For our Partners, please work with our Technical Support Team for those sections where 'templates' are referred to as these are internal configuration steps. Also check the Forcepoint Status page for any current known issues.
 
Resolution
Q: Is there a Firewall between the users and the Cloud clusters?
Latency while at a non-filtered location for Hybrid, or in an office location for Cloud, can sometimes be caused by a firewall. 
 
Diagnostic:
  1. Take a Packet Capture using Wireshark from the client machine. Check for any intermediate connections happening before reaching the cloud, such as a firewall.
  2. Check for "Ignored unknown Records" where the Cloud cluster or datacenter IP is showing. If present, this indicates the firewall may be doing deep packet inspection on HTTPS requests. In the image below, the result was from the Cloud cluster to the target machine.
User-added image
Fix:
To correct, have the customer disable deep packet inspection. At the very least, disable the packet inspection when the information is going to or coming from all of the Cloud data center IPs. For a list, see  Cloud service Data Center (cluster) IP addresses and port numbers.

Q: Am I going to the nearest cluster / using the 'wrong' cluster?
If going to a cluster geographically farther than expected, slow load times for websites are expected.
 
Diagnostic:
  1. On an endpoint machine, browse to http://query.webdefence.global.blackspider.com/?with=all
  2. Check the map (Cloud service Data Center (cluster) IP addresses and port numbers) and the location of the external IP address (http://iplocation.net/) from the IP address given in step 1.
  3. Check the client's selected proxy as follows:
    1. If the customer is using GeoDNS:
  1. If the customer is using GeoIP, then put the external (egress) ip in the name:
  1. Determine what the latency to the potential clusters using ping or better to use psping (http://technet.microsoft.com/en-gb/sysinternals/jj729731.aspx) as shown below.
     
TCP connect to 85.115.52.150:80:
101 iterations (warmup 1) connecting test: 100%
 
TCP connect statistics for 85.115.52.150:80:
Sent = 100, Received = 100, Lost = 0 (0% loss),
Minimum = 3.48ms, Maximum = 4.20ms, Average = 3.81ms
 
The latency should be as low as possible, if it is greater than 100ms then user experience will be poor.
 
Fix:
 
  1. If the customer is not using the local cluster, check this is not 'by design', as some services such as Google Search use the egress IP of the proxy to work out what language / location the requesting client is in, so changing it may cause this to change.
  2. For GeoIP: Check if the Cloud GeoIP was set by using a custom template in Policy > Custom in the past.
  3. For GeoDNS: Essentially as GeoDNS uses the IP address of the DNS server to determine the client location.
    • If the site is not using the egress IP of the proxy for language/location and they're using GeoDNS, check if the customer wants to switch to GeoIP
    • Check which cluster GeoIP would use before changing it - if the location is incorrect check with Backline / COPs to get it fixed.
    • Important For pure Cloud, do not apply the template, tell the customer where the field is in the Portal as they must set it themselves. 
    • For Hybrid, this is in the Forcepoint Security Manager under Settings > Hybrid Config > User Access > Web Browsing Optimization > Route traffic based on end user's egress IP.
    • It is possible for COPs to 'fix' an ISP DNS server's location if it is incorrect - this would need a COPs escalation with justification.
 
  1. If standard GeoIP and GeoDNS both do not suit the customer's need then it is possible to map the customer's egress IP to a particular cluster.
    • Important This may invalidate SLAs and is a serious step.
    • It may be necessary for the customer (or Forcepoint) to host a custom pac file for the customer to force particular behavior, cases for this should be escalated to Backline for investigation and review.
 
 
Q: Client has some issue with local DNS
If Local DNS is having issues, it may present as latency as well as problems with the PAC file. 
 
Diagnostic:
  1. Web Cloud/Hybrid roaming DNS is used to:
    1. Resolve the PAC server. Time to live (TTL) for the PAC lookups is 120seconds, so unless there is an issue causing repeated pac file downloads this is only an impact on the first url after the browser starts.
    2. Resolve the origin server for use with the pac file. This is one of the more common problems.
  1. As windows will cache the DNS reply (even an error reply) for some time, and clearing the cache (ipconfig /flushdns) is quite slow, using ping or similar methods is not accurate. 
  2. Fiddler will show DNS response time as part of the Statistics section for each request.
  3. Packet Captures can also show DNS response time. 
 
Fix:
 
  1. It's up to the customer to give their clients a working DNS infrastructure, it is a complex area. There are some characteristics and problems that must be taken into account:
  • If the DNS server being queried does not have the entry requested, then it will use forwarders to get the answer. A common problem is for the primary DNS to be perfectly ok (probably a local AD server), but be forwarding to ISP DNS servers that are giving no or bad responses.
  • If you repeatedly query an entry that has expired (ie. TTL has decreased to zero) and you get intermittent failures or delays then this is the common source.
  • A site with low TTL for testing is www.websense.com (use host -a <hostname> on a linux box or appliance to see the TTL easily) - it is currently 5 seconds, so each request more than 5s after the first one will cause the server to resolve to the US Websense DNS servers at which point the forwarders would be used.
  1. If the DNS server is screened from the internet by a firewall configured to filter DNS fragmented replies then you may see intermittently no reply. This can cause some delay or un-resolvable hosts - at this time Checkpoint and Juniper are know to have such a capability and have caused issues.
  • This can cause delay in processing the PAC file, however be aware that MS IE caches results from a PAC request, so the client will only see delay on initial page load.
 
 
Q: Is there is a network / TCP problem between the client and the cluster proxy server, slow pages, interrupted pages, page does not display

A useful display filter for Wireshark to show Cloud Web data as it parses the Cloud IP addresses:
ip.addr==85.115.32.0/19||ip.addr==86.111.216.0/23||ip.addr==116.50.56.0/21||ip.addr==208.87.232.0/21||ip.addr==86.111.220.0/22||ip.addr==103.1.196.0/22||ip.addr==177.39.96.0/22||ip.addr==196.216.238.0/23
 
Diagnostic:
  1. Check for partial blocks (ie. some content on a page is blocked but the base html is not), usually the customer blocks uncategorized content or filetypes like .css or .js. It is the customer's responsibility to correct this once they are shown how to check using fiddler2 / firebug / chrome debug.
  2. Capturing data and analyzing:
  1. If the problem is with a 'top 100' site then HTTPWatch can be usefully used, Fiddler is useful but Endpoint would need to be disabled while testing.
  2. Forcepoint TS have an internal tool for analyzing packet captures (Expert System). 
  3. HTTPWatch will show 'blocking' where local AV (or online service) is scanning a URL
  4. Both Fiddler and HTTPWatch visually depict some problems clearly
  • individual requests DNS lookups are shown clearly
  • timeline will show overlapping requests / responses, the number is browser determined, gaps indicate the client just didn't request the next content, an extended bar can indicate a local AV is scanning the content
  • check that the problematic requests actually use the proxy (look for the header X-BST-Info)
  • look for 'Forbidden' (page prohibited by policy) and 'Unsuitable material' (content prohibited during scanning)
  1. Parallel captures with Wireshark are helpful in detecting:
  • packet size limiting - many Duplicate ACKs, but small replies seem to get through
  • packet content filtering - many Duplicate ACKs for specific packets but seems to be triggered by particular content
  • packet congestion - some Duplicate ACKs , zero window will be seen indicating that TCP flow is not working correctly
  • packet shaping - any http No-Op (NOP) packets, but may be similar to any of the above depending on the shaper used (generally the customer will know there is one)
  • upstream request filtering - this is seen in mainland China, Turkey etc. where requests for particular urls are prohibited - see China blocks Internet access for internet users accessing prohibited content