NSX 6.3.6 to 6.4.5 – Controller problem encountered

NSX upgrades can be a delicate thing to upgrade, even though everything is in its finest shape.

After we successfully have upgrade the NSX managers we proceeded with upgrading of the NSX Controllers. We did pre-check and issued command “show control-cluster status” and it looked fine, upgrade to 6.4.6 went well and we could vMotion VMs around after the controller was booted. But post-checks was not ok, the “show control-cluster status” did not return as expected and we where not confident to proceed with the host upgrades.

After some trouble shooting we found that the /var/log partition on 2/3 of the controllers where full. Without any other evidence we concluded that this was the problem. After some google-fu we didn’t find any KB or blogs on how to purge logs.

But we found out that we could get into a engeering mode that would give us shell access. Long store short, we did the following:

1. https://kb.vmware.com/s/article/2149630 to gain shell access on manager
1.1 password is IAmOnThePhoneWithTechSupport
2. Extracting root passwords for controllers with /home/secureall/secureall/sem/WEB-INF/classes/GetNvpApiPassword.sh controller-nn
3. Loged into each controller, and issued : debug os-shell and thereby gain root shell access.
4. Deleted /var/log/syslog.1 on each node.
5. Rolling restart of controllers and after they booted they all joined the cluster.

Β 

After this we got the status as we wanted. In the mean while we had create a case with VMware support and the supporter was on a remote session with us. We told him what we have done, we verified that the controlleres was health and they where.

Next step, VIB upgrade on the hosts.

Good commands to know:

show process monitor
show controller list all
show control-cluster status

Edit: This article from VMware have the exact problem we encountered. We also contacted VMware Support, but before they where able to assist us we had the problem solved. πŸ™‚
https://kb.vmware.com/s/article/59509

Process of getting the root password for controllers.

Tomcat9 and java12 on FreeBSD

Quick post, had to make Tomcat9 and Java12 work together. The procedure is as follows:

1. pkg install tomcat9 (it will also install java8)
2. pkg install openjdk12

Now and edit /etc/rc.conf with a parameter to start tomcat on boot and set the tomcat java_home.

tomcat9_enable="YES"
tomcat9_java_home="/usr/local/openjdk12"

And not to part that took me a long time to figure out. In Java12, there is no longer a feature that tomcat is using in its startup parameters. But if you remove that from the init script you are able to start it up. The line is: Djava.endorsed.dirs=’/usr/local/apache-tomcat-8.0/endorsed’ \

command="/usr/local/bin/jsvc"
command_args="-java-home '${_tomcat_java_home}' \
        -server \
        -user ${_tomcat_catalina_user} \
        -umask ${_tomcat_umask} \
        -pidfile '${pidfile}' \
        -wait ${_tomcat_wait} \
        -outfile '${_tomcat_stdout}' \
        -errfile '${_tomcat_stderr}' \
        -classpath '${_tomcat_catalina_home}/bin/bootstrap.jar:/usr/local/share/java/classes/commons-daemon.jar:$
{_tomcat_catalina_home}/bin/tomcat-juli.jar${_tomcat_classpath}' \
        -Djava.util.logging.manager=${_tomcat_logging_manager} \
        -Djava.util.logging.config.file='${_tomcat_logging_config}' \
        ${_tomcat_java_opts} \
        -Djava.endorsed.dirs='/usr/local/apache-tomcat-8.0/endorsed' \<<<<<<<<<<<< Remove this line!!!
        -Djava.endorsed.dirs='${_tomcat_catalina_home}/endorsed' \
        -Dcatalina.home='${_tomcat_catalina_home}' \
        -Dcatalina.base='${_tomcat_catalina_base}' \
        -Djava.io.tmpdir='${_tomcat_catalina_tmpdir}' \
        org.apache.catalina.startup.Bootstrap \
        ${_tomcat_pipe_cmd}"

run_rc_command "$1"

After this, you are now able to boot tomcat9 with java12 πŸ™‚

Getting all domains from Office365 tenants

Mail spoffing etc. is a big problem, there are technologies that can help, but many domain owners have not yet implemented them. To help our customers we have started to monitor and see if the SPF, DKIM and DMARC policies have been implementened, and if not we can help πŸ™‚

Our own spamfilter solution have a button that gives you an export over all the domains, nice and easy, but Office 365 CSP portal doesnt.

So there is a quick script to help with that. Next post will hopefully contain the checkscript for if the domain have implemented SPF, DKIM or DMARC.

$tenantIds = Get-MsolPartnerContract -All | Select-Object TenantId

foreach ($tenantid in $tenantIds)
{
    $domains = Get-MsolDomain -TenantId $tenantid.TenantId
    $customer = Get-MsolCompanyInformation -TenantId $tenantId.TenantId


    foreach ($domain in $domains)
    {
        if($domain.Name -match 'microsoft')
            {
            }
             else {

                $data = @(
                    [pscustomobject]@{Domain=$domain.Name;Customer=$customer.DisplayName}
                )
                $data | Export-Csv -Path C:\temp\domainsInO365.csv -Append
             }
    }

}

NSX Edge PowerShell manipulation

This is from a VMware support experience. A customer could not change DNS server parameters of the NSX Edge IP Pool. But actually is was a problem due to a bug in VCD 9.5, where a Edge XML config was missing some tags and therefor not being able to validate the XML when VCD post the edited XML config back to NSX manager.

I have attached VMware support answer in the bottom of the post.

Script will get all edges from the NSX manager, then you find the correct one and fill into the next part of the script. Then you get the XML down to a file on your local machine, you then edit the file and put in the missing tags and lastly PUT the XML backup NSX manager. After this operation, it works from the GUI again.

# Import credential module and login information
$ReturnObj = import-credentials vmwareSSO
$nsxUsername = $ReturnObj.Username
$nsxPassword = $ReturnObj.Password

# Other variables
$tempFile = "C:\temp\edge-747_jvr.xml"

# Allow all SSL protocols
$AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12' 
[System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols

Add-Type @"
    using System;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    public class ServerCertificateValidationCallback
    {
        public static void Ignore()
        {
            ServicePointManager.ServerCertificateValidationCallback += 
                delegate
                (
                    Object obj, 
                    X509Certificate certificate, 
                    X509Chain chain, 
                    SslPolicyErrors errors
                )
                {
                    return true;
                };
        }
    }
"@


[ServerCertificateValidationCallback]::Ignore();

# Getting all edges
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "https://10.1.10.4/api/4.0/edges"

[xml]$edges = (Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content
foreach ($edge in $edges.pagedEdgeList.edgePage.edgeSummary)
{
    $edgeInfo = "name: {0} - ID: {1}" -f $edge.name, $edge.objectId
    $edgeInfo
}

# Getting specefic edge config
$edgeId = "edge-747"
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "https://10.1.10.4/api/4.0/edges/$edgeId"

(Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content | out-file $tempFile

# PUT edge config after edit
$Type = 'application/xml'
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "https://10.1.10.4/api/4.0/edges/$edgeId"
$edgeConfigAltered = Get-Content $tempFile

$respons = Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method Put -ContentType 'application/xml' -Body $edgeConfigAltered
# Statuscode 204 is accepted
$respons.StatusCode

From Support:
– The issue you are seeing is a known issue 9.5.
– Like I mentioned in the previous email, this is due to missing elements from the xml.
– From the xml in the logs, I could see there are 52 NAT rules on that edge.Correct me if I am wrong. The following 2 rules had the elements missing

<natRule>
    <ruleId>196726</ruleId>
    <ruleType>user</ruleType>
    <action>dnat</action>
    <vnic>0</vnic>
    <originalAddress>IP</originalAddress>
    <translatedAddress>IP</translatedAddress>
    <dnatMatchSourceAddress>any</dnatMatchSourceAddress>
    <loggingEnabled>false</loggingEnabled>
    <enabled>true</enabled>
    <description>RULE</description>
    <protocol>tcp</protocol>
    <originalPort>3417</originalPort>
    <translatedPort>3478</translatedPort>
    <dnatMatchSourcePort>any</dnatMatchSourcePort>
</natRule>
<natRule>
    <ruleId>196727</ruleId>
    <ruleType>user</ruleType>
    <action>dnat</action>
    <vnic>0</vnic>
    <originalAddress>IP</originalAddress>
    <translatedAddress>IP</translatedAddress>
    <dnatMatchSourceAddress>any</dnatMatchSourceAddress>
    <loggingEnabled>false</loggingEnabled>
    <enabled>true</enabled>
    <description>RULE</description>
    <protocol>tcp</protocol>
    <originalPort>3416</originalPort>
    <translatedPort>3234</translatedPort>
    <dnatMatchSourcePort>any</dnatMatchSourcePort>
</natRule>

I have attached the file with the list of all the NAT rules seen from the logs if you need to cross-verify.

Plan:
– To fix the issue,please follow https://kb.vmware.com/s/article/67193

If you have any further questions,let me know.

Have a good evening,

Best regards,

Deepthy

Make a clone of VMs to NAS – The PowerCLI way

Quick post, had a customer that yearly wants a clone of their VMs, copied to a NAS, and then shipped to customers HQ. The owner of the company put this as a requirement. Fair enough. I have almost always done the clone of the VMs by GUI, in the start, this was easy because they only had 5 servers, but they now have more. So this time I wanted to try and script it instead. It took me some extra time, but in the end, I think it’s worth it. My PowerShell skills are not great, still learning so bear with me.

# Variables
$vcenter = "<IP or hostname>"
$cluster = "<name of cluster that contains the servers>"
$nfsIP = "1.2.3.4"
$nfsMount = "/nfs"

# Getting VMware PowerShell Modules
Get-Module -Name vmware* -ListAvailable | Import-Module

# Connect to vcenter
Connect-VIServer -Server $vcenter -User <username>

# Mount NAS 
get-cluster $cluster | get-vmhost | new-datastore -nfs -name NAS -path $nfsMount -nfshost $nfsIP

# More Variables
$ds = get-datastore NAS
$tempHost = "esx74.domain.tld"
$vms = Get-VM -Name customer*

# Copy all 
foreach ($vm in $vms)
{
    new-vm -name "$($vm.name)-clone" -VM $vm -Datastore $ds -vmhost $tempHost
    Remove-VM -VM "$($vm.name)-clone" -DeleteFromDisk:$false -Confirm:$false -RunAsync
}

# Remove datastore from hosts again
Get-Cluster $cluster | Get-VMHost | Remove-Datastore -Datastore $ds

FreeNAS installation failed – Operation not permitted

Ran into an annoying problem, have been having this problem multiple times in the past, and never remember what the fix is. So now it’s on the blog for the next time that I need it. Use the shell option in the FreeNAS installer, the disks that I wanted to install onto was ada0 and ada1.

1. sysctl kern.geom.debugflags=0x10
2. dd if=/dev/zero of=/dev/ada0 bs=512 count=1 && dd if=/dev/zero of=/dev/ada1 bs=512 count=1

This will wipe out the sectors that keep the partition schema and afterward, you can install FreeNAS without problems.

ESXi – physical memory population

Had en interesting problem where a ESXi host only showed it had 30GB of memory, but the motherboard was populated with 6*8GB modules. In earlier versions of ESXi 5.5< it was possible to use dmidecode to show how the physical hardware was populated. But since 6.0> that have been removed.

The new command to find those kind of information are now “smbiosDump”

smbiosDump | grep -A 6 'Memory Device'

You can also just run smbiosDump without any paramenters and you get a hole lot of information to crawl through.

Timedrift – domain controller and clients

Customers domain controllers where both virtual and due to CPU congestion it seems that time had been drifting.

So it was 5 minutes behind, and so was the clients. The fix is as follows.

1. Find the DC that have the PDC role.

netdom query fsmo

2. Issue the follwing command to sync the time with some of the pool.ntp.org servers.

w32tm /config /syncfromflags:manual /manualpeerlist:"0.pool.ntp.org 1.pool.ntp.org 2.pool.ntp.org 3.pool.ntp.org /reliable:yes /update "

3. After the time on the PDC again is correct, then issue following on the other domain controllers, that are not PDC.

w32tm /config /syncfromflags:domhier /update

4. let the clients resync there time, either wait for it to happen or issue the following

w32tm /resync

Freebsd – Going from stable to release

I outdated FreeBSD 10.1-Stable server needed to be updated for it to install packages again. Problem was, it was deployed from stable, i normally never use stable because it not production ready, its a development branche. But this server was stable and here are the steps to get it to a release train.

1. Update the source tree on the server.

cd /usr/src && rm -rf * && svnlite switch https://svn.freebsd.org/base/releng/10.4 /usr/src/

2. Follow the link below. Only change i did was to “-j 6” when making so that it used multiple cores.
https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html

After the merge-ui command you just choose “later” for all the promt with merge of config files.

And volia, The server is updated to FreeBSD 10.4-release. Then you can update with the binary process, freebsd-update upgrade -r 11.2-RELEASE.

In case you need to know more about the stable vs. release, here is a link i found en the FreeBSD forum.
http://srobb.net/release.html

P2V of a FreeBSD

Was looking for someone that had done a P2V of FreeBSD, but me google skills was not good enough. So here it goes.

1. Make a VM with disk a bit larger then the source phycial machine.
2. Boot the VM on a FreeBSD live cd
3. Give it a IP address

ifconfig vmx0 inet 10.0.2.59 netmask 255.192.0.0

4. Make NC listen for any input on port xxxx ( )

nc -l 6666 | dd bs=16M of=/dev/da0

5 on the physical maskine you run DD and pipe it to NC.

dd bs=16M if=/dev/ad0 | nc 10.0.2.59 6666

6. wait for it to finish….

Ceph – slow recovery speed

Onsite at customer they had a 36bays OSD node down in there 500TB cluster build with 4TB HDDs. When it came back online the Ceph cluster started to recover from it and rebalance the cluster.

Problem was, it was dead slow. 78Mb/s is not much when you have a 500TB Cluster. So what to do?

There a several settings within Ceph you can adjust. Here are the two settings that worked for me.

osd max backfills:
Description: The maximum number of backfills allowed to or from a single OSD.
Default value: 1

I set it to 8, and the recovery went to 350Mb/s. Set it to 16 and recovery was 700Mb/s, but clients where also affected. So 8 was a more moderat setting.

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'

osd recovery max active

Description: The number of active recovery requests per OSD at one time. More requests will accelerate recovery, but the requests places an increased load on the cluster.
Default value: 3

Set it up a notch to 4.

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
0

Mysql replication

Had to resync master-slave replication setup. Here are my notes on how it’s done.

Binary bin-log files are kept for 7 days on DB1. If the replication is stopped for more than a week DB2 replication cannot start again due to the binary log files is no longer available. therefor a fresh dump is needed and DB2 replication can be started again from the master log position.

Procedure:

Parameters:

  • Single-transaction, makes it possible to do the dump without locking the database, very useful when having to dump from a production database. But while not locking the DB you may not create or alter table schema.Β  Mysql documentation link
  • master-data, is very useful because it records the master position when doing the dump and putting it into the output of the dump file. Therefore it is much easier to start the slave from the correct position. The number 2 is for only printing it to the output as a comment. Mysql documentation link
  • event and routines, if there are any stored procedures or like in the old server we take them with us.Β Mysql documentation link

mysqldump --single-transaction --quick --master-data=2 --events --routines <DATABASE> | gzip > /data/<DATABASE> _`date +%F`.sql.gz

When the dump is done we move the dump file over to the other server. Here we import it to the MySQL server if there already were an old database in place, drop it and create it again.
zcat <DATABASE>.sql.gz | mysql <database>

Also, have a look at the head of the dump file where we will find the master position data that we need to start the replication again.

gzip -cd <DATABASE>.sql.gz | head -n24

Now we have the position and need the user for replication. I did it on an older 5.5 database, in newer MySql servers it is done differently.

GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'happyS3ed99'';

Or if the user is in place and you just need to reset the password:
SET PASSWORD FOR β€˜repl’@β€˜192.168.10.11’ = PASSWORD('happyS3ed99'); FLUSH PRIVILEGES;

When it is imported we need to setup the master to master(slave) replication again. Remember to have a user on DB1 that is allowing replication from the DB2 server and have the user and password ready CHANGE MASTER TO MASTER_HOST='<IP>',MASTER_USER='repl', MASTER_PASSWORD='happyS3ed99', MASTER_LOG_FILE='mysql-bin.000849', MASTER_LOG_POS=758329777 ;

It will now start to replicate from the master, now you can do a “mysql -e ‘show slave status\G'” and see if the slave IO is running as it should.

0

Rescuing a Juniper SRX550


Notthing more greate than getting a call from HQ 30 minutes after closing hours. Never the less i decided to take the call. Network problem onsite at customer…. After getting green light from women in charge, i got in the car an when on to the customer.

Connection with the USB cable to the SRX console port i got a weird boot sequence. Just like the following:

>U-Boot 1.1.6-JNPR-2.7 (Build time: Nov 26 2013 - 19:04:49)

Initializing memory this may take some time...
Measured DDR clock 533.33 MHz
SRX_550 board revision major:1, minor:18, serial #:
OCTEON CN6335-AAP pass 2.2, Core clock: 1300 MHz, DDR clock: 533 MHz (1066 Mhz data rate)
DRAM:  2048 MB
Starting Memory POST...
Checking datalines... OK
Checking address lines... OK
Checking 512K memory for U-Boot... OK.
Running U-Boot CRC Test... OK.
Flash:  8 MB
USB:   scanning bus for devices... 1 USB Device(s) found
       scanning bus for storage devices... 0 Storage Device(s) found
Clearing DRAM...... done
BIST check passed.
PCIe: Initializing port 1
PCIe: Port 1 link active, 1 lanes, speed gen1
Boot Media: usb internal-compact-flash
Net:   octeth0

  ide 0: Model: CF CARD  Firm: Ver7.01K Ser#:
            Type: Removable Hard Disk
            Capacity: 3811.9 MB = 3.7 GB (7806960 x 512)

Warning!!!  SSD not detected
POST Passed
Press SPACE to abort autoboot in 1 seconds
ELF file is 32 bit
Loading .text @ 0x8f0000a0 (246560 bytes)
Loading .rodata @ 0x8f03c3c0 (14144 bytes)
Loading .reginfo @ 0x8f03fb00 (24 bytes)
Loading .rodata.str1.4 @ 0x8f03fb18 (16516 bytes)
Loading set_Xcommand_set @ 0x8f043b9c (96 bytes)
Loading .rodata.cst4 @ 0x8f043bfc (20 bytes)
Loading .data @ 0x8f044000 (5744 bytes)
Loading .data.rel.ro @ 0x8f045670 (120 bytes)
Loading .data.rel @ 0x8f0456e8 (136 bytes)
Clearing .bss @ 0x8f045770 (11600 bytes)
## Starting application at 0x8f0000a0 ...
Consoles: U-Boot console
Found compatible API, ver. 2.7

FreeBSD/MIPS U-Boot bootstrap loader, Revision 2.7
(ccheng@svl-junos-d081.juniper.net, Tue Nov 26 19:05:43 PST 2013)
Memory: 2048MB
[1]Booting from internal-compact-flash slice 1
Un-Protected 1 sectors
writing to flash...
Protected 1 sectors

can't load '/kernel'
can't load '/kernel.old'
Press Enter to stop auto bootsequencing and to enter loader prompt.


U-Boot 1.1.6-JNPR-2.7 (Build time: Nov 26 2013 - 19:04:49)

Initializing memory this may take some time...

Either the Junos partition was corrupt or the disk inside of the unit was fried. Decided to try and install Junos again just see if that would help. Went to juniper.net and downloaded the oldest Junos version available, junos-srxsme-12.3X48-D10.3-domestic.tgz. Found a USB drive and put the .tgz file on it and plugged it in the SRX. From the console i broke the bootloader while its was trying to find kernel and issued the following command.

file:///junos-srxsme-12.3X48-D10.3-domestic.tgz

I began to install Junos, but when it tried to create partitions on the card, it died with DMA errors. Great!

Since a SRX550 is not something you find everyday and spareparts a hard to get (support was also expired) i decided to take the srx apart. happily to find a CF card inside and luckily i found a kingstone CF card in my bag (I knew that would come in handy someday). Swapped the card and put it together again.

Power on and issued the install command again. This time with success.

The install of Junos take sometime, a long time 20 minutes. But then you also get a very nice login prompt. logged in with root and no password. Went into cli configuration mode and did a “delete” to wipe the factory config. then loaded the backup configuration with

load overwrite terminal

Pasted the 55kb JSON config into the console and finished with a ctrl+d followed by a commit. commit success and all network was suddenly alive again.

just to make all the LEDs green on the SRX i did wrote the config to rescue config. This is in operational mode.

request system configuration rescue save

A happy consumer and hopefully a new Juniper SRX1500 firewall on its way to relive the SRX550 off its duties.

0

PowerCLI – View host HA status

Had a minor problem with a host that was not able to configure HA agent after a vCenter update, 6.5 build 15000 to build 21000. It was the only host in the cluster that had the error.

Tried:
– set the host in and out of maintenance mode and to move the host out and in of the cluster. Did not help.
– disable and enable of HA on cluster level work for all the other host, but not my stubborn one.

Reading a VMware 2056299 told me to manuel uninstall the HA vib (vmware-fdm) with

esxcli software vib remove -n vmware-fdm

After successfully uninstall i took the host out of maintenance and did a Disable/enable HA on cluster level, and volia it now works.

GUI is always a bit slow to update, but with PowerCLI you get current status.

PS C:\> $clusterName = "Cluster1"
PS C:\> Get-Cluster -Name $clusterName | Get-VMHost | Select Name,@{N='State';E={$_.ExtensionData.Runtime.DasHostState.State}}

0

Import certificate to NSX Edge

Normally when I get a certificate from a customer I often get it in PFX format, but NSX Edge wants it in PEM format. What often is confusing here is that the when converting the PFX the private key gets out in the PKCS8 format but Edge wants the private key in PKCS1 format.

Here is a write-up of the conversion. You will need OpenSSL on the machine that you work on windows, UNIX or macros doesn’t matter.

First, we will need to spit the PFX into .crt and .key with these two commands

openssl pkcs12 -in [yourfile.pfx] -nocerts -out [private.key]
openssl pkcs12 -in [yourfile.pfx] -clcerts -nokeys -out [certificate.crt]

Now we need to convert the private.key from PKCS8 to PKCS1 format with this command

openssl rsa -in private_pkcs8.key -out private_pkcs1.key

Now you can go to your NSX Edge and import the certificate with .crt and pricate_pkcs1.key files

Later on, I have found that I need to import the certificate with the intermediate certificate of the signing 3. party. In my case its GoDaddy.
To do this we convert the certificates to .PEM and afterward.

.\openssl pkcs7 -print_certs -in gd-g2_iis_intermediates.p7b -out gd-interm.pem

Then we can convert the cert to pem and put the two certificates in the same file.

openssl x509 -in hk-domain.dk.crt -out hk-domain.dk.pem -outform PEM

Now you can paste this into the load balancer. There must be an easier way to do it, if you find one then please ping me.

Edit: Found a couple of other OpenSSL commands that I from time to time struggle to find.

# Convert unencrypt pem cert:
openssl rsa -in file1.key -out file2.key
# Convert pkcs7 to pem format:
openssl pkcs7 -print_certs -in certificate.p7b -out certificate.cer