Kerberos SSPI/PAC errors and NetLogon errors 5719 and 5783 and Login Failure Audits – Oh my!

We appear to be having a bunch of Kerberos errors in our SQL clusters that represent 2-30 minutes of downtime at a stretch.

A recent network change seemed to help a lot, but is unfortunately not technically supposed to affect our issues at all. Still we wait and see if the errors happen again.

The outages have been happening randomly for the past 5 days (starting last Thursday, ending yesterday – Monday), probably 4 – 6 incidents a day, for the aforementioned 2 – 30 minutes at a stretch.

Samples of the associated events in the Event Logs:

Bunches of failure audits alternating with Kerberos SSPI Handshake errors in the Application Logs of the SQL Cluster (clustered via Microsoft Clustering):
Kerberos SSPI:

Event Type: Error
Event Source: MSSQLSERVER
Event Category: (4)
Event ID: 17806
Date: 9/29/2008
Time: 8:17:52 AM
User: N/A
Computer:
Description:
SSPI handshake failed with error code 0x80090311 while establishing a connection with integrated security; the connection has been closed. [CLIENT: xxx.xxx.xxx.xxx]

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

The alternate failure audits:

Event Type: Failure Audit
Event Source: MSSQLSERVER
Event Category: (4)
Event ID: 18452
Date: 9/29/2008
Time: 8:17:52 AM
User: N/A
Computer:
Description:
Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: xxx.xxx.xxx.xxx]

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

In the Systems log, we commonly see one Netlogon (Event ID 5719) and one Kerberos event (Event ID 7), and one rare Netlogon (Event ID 5783) event that are concurrent with the Application Log events:

Netlogon Event ID 5719:

Event Type: Error
Event Source: NETLOGON
Event Category: None
Event ID: 5719
Date: 9/29/2008
Time: 8:15:59 AM
User: N/A
Computer:
Description:
This computer was not able to set up a secure session with a domain controller in domain due to the following:
There are currently no logon servers available to service the logon request.
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.

ADDITIONAL INFO
If this computer is a domain controller for the specified domain, it sets up the secure session to the primary domain controller emulator in the specified domain. Otherwise, this computer sets up the secure session to any domain controller in the specified domain.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À

Kerberos Event ID 7:

Event Type: Error
Event Source: Kerberos
Event Category: None
Event ID: 7
Date: 9/29/2008
Time: 8:15:59 AM
User: N/A
Computer:
Description:
The kerberos subsystem encountered a PAC verification failure. This indicates that the PAC from the client in realm had a PAC which failed to verify or was modified. Contact your system administrator.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 5e 00 00 c0 ^..À

The rare NetLogon Event ID 5783:

Event Type: Error
Event Source: NETLOGON
Event Category: None
Event ID: 5783
Date: 9/27/2008
Time: 1:00:56 PM
User: N/A
Computer:
Description:
The session setup to the Windows NT or Windows 2000 Domain Controller \\ for the domain is not responsive. The current RPC call from Netlogon on \\ to \\ has been cancelled.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

We are still watching the situation but suspect that high network latency may be the issue. Other possibilities are a stale DNS cache somewhere and possibly a Veritas Diskeeper 2007 bad install/upgrade. I plan to escalate the general troubleshooting questions to our Microsoft Technical Account Manager today or tomorrow.

Getting MOSS 2007 RTM Backup to work when your SQL Database is one you don’t control

In this post I will be somewhat generic about my machine names and account names. This is partly to keep me in form and not thinking too specifically about my special situation and partly for security reasons. I don’t believe in security through obscurity but I also don’t believe in making it extra easy for an attacker to get the first steps of the puzzle for free.

Configuration of my MOSS 2007 RTM install in development:

  • 1 Server that serves all databases to all development environments (Call it SQLDEV1).
    • I have a specific instance (Call it MOSD) in which I can put all my various MOSS databases and my domain user account (Call it DOMAIN\myuser) is a security admin and dbcreator in the Server.
    • Since I used my domain user account (DOMAIN\myuser) to give all the various configuration services and app pools and datbases an ID during initial setup/configuration of MOSS 2007, I (DOMAIN\myuser) am also dbo on all the MOSS databases in the instance.
    • Additionally, it should be noted that on SQLDEV1, the MSSQLSERVER service is running as one domain user account (Call it DOMAIN\dbuser1), and the instance (MOSD) is running as another (Call it DOMAIN\dbuser2).
  • 1 Server that has all the MOSS 2007/.NET 2.0/.NET 3.0 WFX installs (Call it MOSSDEV1).
    • On this machine, my user account (DOMAIN\myuser) is a local administrator. I also have some additional rights that allow me to log into the machine remotely with a terminal services client.
    • It was a precondition to my getting these rights that I set up the MOSS 2007 install so that all configuration aspects that required local admin access be set up with my domain account’s ID (DOMAIN\myuser). I know this isn’t ideal in some respects, but it seems to work okay in this development environment. This will be the main factor in making this blog entry not be entirely helpful in debugging YOUR configuration problem, but I hope it’ll help anyway.

With no additional special steps, I couldn’t get MOSS 2007’s Operations interface to successfully back up the Farm (via the Central Administration: http://your_server:your_port/_admin/Backup.aspx). I kept getting some progress, but each and every actual site collection failed with a failure message ending in “Operating system error 5(Access is denied.). BACKUP DATABASE is terminating abnormally.” [It should be noted that I was previously getting an error 3, which appeared to be tied to using a non-UNC path in the Backup Location field.]

The steps necessary for me to get it all working were:

  1. Create a shared folder on the system for where you want the backups to go. (i.e. \\MOSSDEV1\Farm Backups\).
  2. Add full control permissions to this share for all three accounts: The service account for MSSQLSERVER on the database server (DOMAIN\dbuser1), the service account for the database instance hosting the content/configuration databases (DOMAIN\dbuser2) and the service account running the MOSS 2007 application pool (my guess of the most likely suspect in this operation – DOMAIN\myuser).
  3. Use the share name you created in the Backup Location text box in the 2nd step of the Backup configuration in Central Administration (i.e. \\MOSSDEV1\Farm Backups\)

I had been using the Admin’s traditional UNC path (i.e. \\MOSSDEV1\d$\Farm Backups\) without explicitly sharing a folder, but that tripped me up, because the SQL Server service account IDs were not Local Admins on the MOSS 2007 box (i.e. MOSSDEV1).
Also it took a bit of digging and asking my DBAs about the service account identities for the SQL Server.

My main reference is about SQL Server Backup in the Microsoft KB #255235.