Paul's Internet Landfill/ 2016/ Tools for Debugging Samba

Tools for Debugging Samba

Recently I spent several unpleasant days trying to troubleshoot an authentication problem with some Samba shares. I will describe the problem later, but more important than the particular issue is the set of tools I picked up to troubleshoot Samba issues. Apparently this suite of tools is obvious to everybody else on the Internet but me, so I am writing them down for my future self.

The bad news is that Samba is complicated, and tracing what the software suite is doing is tricky. The good news is that Samba has lots of debugging infrastructure and is open source. I have no intention of compiling Samba from source ever, but the debugging logs print function names, so following along the program flow by looking through source files was possible (and not even that unpleasant, since the source code is fairly readable). I was debugging on Ubuntu, so I grabbed the .orig.tar.xz file from http://packages.ubuntu.com/trusty/samba and followed along.

As indicated in the above link, I was using Samba 4.1 when doing this troubleshooting.

Log files and smbcontrol

Samba generates logs of log files. Thanks to the following line in smb.conf:

log file = /var/log/samba/log.%m

there is a separate log file generated by each host that connects to the share. There are also log.wb-<DOMAIN> and log.winbindd files that are useful. If you are able to troubleshoot when the shares are being used less, then going to the /var/log/samba directory and typing ls -ltc will show you the latest log files that were accessed. These files might have clues about the problematic behaviours.

You can also increase the verbosity of log files on the fly, which is really handy. Log verbosity ranges from 0 to 10 (I guess the Samba developers never watched This is Spinal Tap.) Verbosity 10 prints out low-level data structures that are being sent and received, so it was helpful for me.

To increase the verbosity of winbind logging to 10 on the fly, type:

smbcontrol winbind debug 10

You can change the verbosity of other components as well.

You want to decrease this back to 0 or 1 after troubleshooting, because there is a lot of logging that probably slows Samba down.

There is also a pdbedit command which apparently is useful for manipulating SAM password files, but I did not need it for this task.

Name Lookups and Winbind tools

My problem had to do with winbind finding incorrect names from our Active Directory LDAP database. The easiest way to look up a name was to type:

getent passwd

or

getent passwd <username>

The wbinfo command also is useful. wbinfo -u seems to do much the same as getent passwd . wbinfo --lookup-sids was useful to translate Windows security identifiers into user names.

Restarting Samba and Clearing Cache

In several situations I found myself wanting to start and stop Samba. On Ubuntu, this meant typing:

service smbd stop
service nmbd stop
service winbind stop 

and the corresponding "start" commands to restart the services.

To flush the caches, you are supposed to be able to type:

net cache flush 

but this does not actually flush every cache.

You are supposed to be able to run winbind with a -n switch to bypass caching, but this does not ignore all caches.

TDB caches and tdbtool

Samba caches information in a number of "trivial database" (.tdb) files. On my Ubuntu system, these files were located in /var/lib/samba and /var/cache/samba . Supposedly all of these files (or maybe the ones in these folders?) are regenerated if they are deleted. The Samba documentation page suggests that some of these files should not be deleted.

To dump the contents of a database use the tdbtool command. For example:

tdbtool <filename> dump 

If you redirect this to a file you get something you can grep.

Domain Member Operations

It is easy to discover how to join a domain from Samba:

net ads join -U <admin-user-name>

It is less easy to discover how to unjoin a domain:

net ads leave -U <admin-user-name>

Windows LDAP Logging

This is not really Samba related but is really useful. You can make your domain controllers report the exact LDAP queries from clients. The trick is documented here: http://serverfault.com/questions/382665/are-there-alternatives-to-sysinternals-adinsight . Set the following registry values on your domain controller:

Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics\15 Field Engineering
Type: DWORD
Value: 5

Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\Expensive Search Results Threshold
Type: DWORD
Value: 1

I am not sure why the second registry entry should be required. The documentation suggests it is for filtering certain types of LDAP queries. But I was not able to get things working without it. The following registry entry may also be relevant:

Path: HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics\16 LDAP Interface Events
Type: DWORD
Value: 2

but I do not know for sure. (It appeared to be set on my domain controller and I don't know why.)

In addition to these registry edits I also had to increase the size of the "Directory Service" event log. Go to Event Viewer, right-click the event log in question, choose "Properties" and then increase the "Maximum Log Size".

There is official Microsoft documentation here: https://support.microsoft.com/en-us/kb/314980

Process Tracking

At one point I needed to figure out what domain controller winbind was connecting to:

lsof -i -n | grep winbind 

strace is really verbose but sometimes useful in figuring out what filenames are being touched. There are a bunch of invocations you could use. I first restarted winbind to get a process number

# service winbind restart
winbind stop/waiting
winbind start/running, process 5010

and then I called strace to attach to this process:

strace -o /tmp/winbind.strace.01 -f -ttT -p 5010 

-p indicates the process to attach to. -f attaches to the process AND its children. -ttT does some time output I don't understand but is probably useful.

tcpdump can be helpful in two ways: you can see whether your process is talking to the hosts you think it is and you can generate PCAP files that you can then use in Wireshark. To look at communications between my machine 10.10.10.25 and the two domain controllers 10.10.10.2 and 10.10.10.3 I used the following invocation:

tcpdump -i eth0 -n 'host 10.10.10.25 and (host 10.10.10.2 or host 10.10.10.3)'

Note the single quotation marks. They are important because of shell parsing.

If I want to generate a PCAP file I use the -w switch. In this case we can also look for port traffic from my machine on port 389:

tcpdump -i eth0 -n -w /tmp/winbind.tcpdump.01.pcap 'host 10.10.10.25 and port 389'

Sidebar: The troubleshooting task

We have a Samba server which has some WPKG scripts. On shutdown, our staff machines are supposed to run these scripts. When logged in, we could run the scripts manually, but the shutdown scripts were not working.

In looking at the Samba smbd logs for the machine, it looked like the computer was trying to authenticate as the wrong machine. For example, the log for machine new-machine04 would have the following entries:

[2016/03/01 02:53:04.537446,  0] ../source3/auth/pampass.c:797(smb_pam_accountcheck)
  smb_pam_accountcheck: PAM: Account Validation Failed - Rejecting User old-machine01$!
[2016/03/01 02:53:04.537730,  1] ../source3/auth/user_krb5.c:142(get_user_from_kerberos_info)
  PAM account restrictions prevent user [TWC\old-machine01$] login
[2016/03/01 02:53:04.537759,  1] ../source3/auth/auth_generic.c:97(auth3_generate_session_info_pac)
  Failed to map kerberos principal to system user (NT_STATUS_WRONG_PASSWORD)
[2016/03/01 02:53:04.637576,  0] ../source3/auth/pampass.c:797(smb_pam_accountcheck)
  smb_pam_accountcheck: PAM: Account Validation Failed - Rejecting User old-machine01$!
[2016/03/01 02:53:04.637860,  1] ../source3/auth/user_krb5.c:142(get_user_from_kerberos_info)
  PAM account restrictions prevent user [TWC\old-machine01$] login
[2016/03/01 02:53:04.637889,  1] ../source3/auth/auth_generic.c:97(auth3_generate_session_info_pac)
  Failed to map kerberos principal to system user (NT_STATUS_WRONG_PASSWORD)

There was something going wrong with machine accounts. Removing the computer from the domain, deleting the computer object, and then re-joining the computer fixed the problem, but I did not know where the old-machine01 entries were coming from. They were legitimate machine names we had used in the past, but they were mapped incorrectly. I was freaked out because I thought that our Active Directory might be inconsistent, which would have been a huge problem.

Fortunately, the problem ended up being Samba's fault. The smb_pam_accountcheck routine uses information returned from winbind, and winbind was referencing a cache file /var/cache/samba/netsamlogon_cache.tdb that is (a) stored in a different place than other cache files and (b) is never flushed or ignored, even if you specify the -n parameter to the winbind daemon. This file contained the mysterious old machine names.

I believe the problem is as follows: sometimes instead of reimaging a computer, we would rename an existing computer and give it to somebody new. This meant that the computer SID would not change. Active Directory stored the new information (and a new machine account password) and the netsamlogon_cache.tdb file kept the old machine account name with the same SID. This lookup would always fail, so Samba would reject the connection request, so the shutdown script would not work. Deleting or renaming the netsamlogon_cache.tdb file fixes the problem. Unfortunately I am so inefficient at troubleshooting Samba that figuring this out took me days.