Please enable javascript, or click here to visit my ecommerce web site powered by Shopify.

Community Forum > Snapshots - LUN assignment > 255 "bug" - no access from ESXi

Hi again....

I was playing with the very nice StorageVolume snapshot features... really a great feature!
I have configured some hourly/daily/weekly... snapshots just to see how is it working, what kind of changes there are in such time periods etc.
Yesterday, after quite some time not doing this, I tried to mount an older snapshot volume snapshot to get some data I managed to destroy.

I have two esx hosts, connected to QS via FC.
I assigned the snapshot to the hosts and at first all seemd "like usual"... but my esx hosts (v5.5) did not see any "new" drives...
After a little peeking arrount I noticed the snapshot volume was assigned the FC SCSI LUN 527 - a little "strange" LUN ID I would say...

OK, after some more peeking and digging I finally found some answers...
The documents from VMWare says max LUN ID supported is 255 and I also found some other references and tests confirming that: http://sflanders.net/2011/10/02/esxi-lun-id-maximum/

So.... is there a way to change the LUN assignment for the volumes? I know your documentation says "LUNs are assigned automatically" but.... that assignment is not really "useful" it seems.
There is no problem with the iSCSI access as there the LUN is always 0... but for FC... it's a real problem...
Is it really possible that noone noticed this before??

QS version 4.2.0.375

Any suggestion?

Best regards,
M.Culibrk

March 12, 2017 | Registered CommenterM.Culibrk

Hello,

In this case, you should be able to use the lun id with the latest version of vSphere, which supports lun id's up to 1023. If you have a specific need beyond this, then perhaps some more information about the issue you are having would help to provide better understanding.

Also, would you be able to clarify if this is for the free community edition? Otherwise, are you looking to consult or become a partner and have direct support access? If so, you can contact Sales Engineering at sdr@osnexus.com if you have a particular customer and pending sale up for discussion.

Thanks

March 13, 2017 | Registered CommenterAaron Knodel

Hi M,
Just to add to Aaron's feedback, I think that we could be doing a better job in picking FC LUN numbers. With iSCSI we are always using LUN 0 because each volume and snapshot has a unique IQN with the device on LUN 0. With FC the model is different and we have to account for ALUA environments where the same volume is presented from multiple appliances and we want those to be exposed with the same LUN number. I think that the issue is that we're picking the LUN number too soon, essentially we should have "lazy" allocation of LUN numbers once the storage is assigned. That would prevent these scenarios where you're seeing these high LUN numbers. Could you share the number number of volumes and snapshots you have in your configuration?
Best,
-Steve

April 3, 2017 | Registered CommenterSteve

Hi there!

Sorry for this really long response time....

The issue is exactly as you described it! You allocate LUN numbers for volumes immediately on creation instead of allocating them when assigned to some host.

I have something like 10 volumes active and few snapshot schedules which run each 1-2 hours (first one) and it keeps 3-4 snapshots and the other which executes daily making kind of "daily/weekly/monthly" "backup snapshot".

LUN numbers are going really high right now... I think it's at approx 1750 right now...

example:

Compression Ratio 1.54x
Created By admin
Created Time Stamp Sun Oct 29 23:01:01 GMT+100 2017
Description Auto-generated by snapshot schedule 'Weekly'.
IQN iqn.2009-10.com.osnexus:ecf9ea6f-53c6794bd7079c10:LJ-01-SAS1.GMT20171029.220101
Internal ID 53c6794b-d707-9c10-6422-0e9e2566afbc
Internal Location /dev/zvol/qs-ecf9ea6f-975f-7738-1f12-d74accbc2de4/53c6794b-d707-9c10-6422-0e9e2566afbc
Is Snapshot? true
Is Thin Provisioned? true

LUN 1684

Modified By admin
Modified Time Stamp Sun Oct 29 23:01:01 GMT+100 2017
Name LJ-01-SAS1_GMT20171029_220101
Owner(s) admin

If you have any idea on how to solve this... it would be really, really appreciated!

Best regards,
M.Culibrk

November 17, 2017 | Registered CommenterM.Culibrk

Hi M,
Technically, you could edit the StorageVolume rows in our internal SQLite database under /var/opt/osnexus/quantastor/osn.db to re-order the LUN numbers. WARNING: This falls under the "at-your-own-risk" category as you can easily mess up the database and you'd want to do be sure that the the service is stopped 'service quantastor stop' and your clients are all disconnected etc, etc, and you'd need to reboot the appliance after the change so that the target driver is all reloaded and such. I'll check with engineering to see about getting this into the roadmap for v4.5 now that v4.4 has shipped so that you don't need to do hacky stuff like what I describe above.
Best,
-Steve

November 17, 2017 | Registered CommenterSteve

Thanks for the quick response!

Yeah... I already kinda did what you suggested... some time ago when I really needed to access a snapshot for recovery... and it worked... and saved my day (or better a few nights).

Anyway... the thing that I do not get/understand is.. how that no one else noticed/hit that problem??
It's kind of scary when you "count" on the snapshots being there, and all seems right, till the moment you actually need/try to use the volume/lun... no errors of notifications of any kind on any side... just "the lun is not there to be seen". Panic!
But, after a few "tranquilizer" shots and a few deeep breaths things slowly unhide from the shadows... :)

So, thanks for all the help & support, and a great piece of SW!

Regards,
M.Culibrk

November 18, 2017 | Registered CommenterM.Culibrk

M, thanks for the feedback on this. We're going to address this in v4.4.1 as a hot fix. The plan is to make the LUN numbers dynamically assigned at the point in time a Storage Volume is assigned to a Host/Host Group. Unassignment and or deletion of a Storage Volume will return the LUN number to the available list so that it may be reused. Further, we're looking at the option to assign static LUN numbers so that they can be made sticky between assign./unassign operations.
Best,
-Steve

November 20, 2017 | Registered CommenterSteve

M, we've fixed the issue in v4.4.1 and we're just finishing up the testing. Look for the new QuantaStor v4.4.1 release on Monday which will address this.
Best,
-Steve

December 15, 2017 | Registered CommenterSteve

Hi there again!

First of all, let me congratulate for all your improvements and new versions you published! Fantastic job!

...but, after some "don't touch don't worry" months I'm again encountering FC LUN assignement problems. This time "is even worse"...

Some time ago i updated my box to version 4.5.2.001 and did not "touch" and LUNs and/or snapshots... but now I really need to access an old(er) snapshot to retrieve some "lost data".
I saw many changes/improvements in LUN handling, and the new "remove/free unused LUNs"... which seems great... BUT...

I now assign a snapshot to a host, QS then assigns a now low (it got LUN 1), supposedly unused/free LUN to the volume... but I see nothing on the host side. Then I started to look around and check volumes and LUN assignment...

WTF? Now I have duplicate LUNs assigned???!?! how is that possible?
Hopefully nothing really bad happened - the "right" (old) LUN which has the LUN assigned waaay back is still accessible, and the new snapshot is not BUT... it was a "heart stopping" moment...

Whatever I do i cannot change/make QS assign another really unused LUN for the newly "assigned" volume... I tried to change the LUN value in sqlite database but it gets overwritten when the volume is touched from QS admin interface...
i simply can't restart the server to make him "understand" as it's being actively used. I stopped quantastor, tomcat, qs_restd services before but it does not help.

What would be the "right" way of assigning/changing this "smart" auto assignment of LUN ID to the volume? ...just don't tell me "restart the server"...

Any help is greatly appreciated.

Regards,
M.Culibrk

June 22, 2018 | Registered CommenterM.Culibrk

QuantaStor has a few modes of operation for LUN numbering and one of those allows pinning the number so it never changes. With HA pools the numbers are selected from the unused LUNs taking into account both the pool primary and secondary.
My guess is that you have a pinned LUN on the snapshot and so it is unable to present it. It should have reported an error for that.
Please send logs to support so that we can get a closer look. There is a LUN assignment policy you can adjust at the command line via volume-modify iirc.
Best,
Steve

June 22, 2018 | Registered CommenterSteve

Thanks for a really quick answer (as always)!

i think you misunderstood my issue.
I have several volumes already active/assigned and "production" working... then I have a snapshot schedule set to take snapshots of those volumes from time to time.
Now I want to make one of these snapshots available to a host to get some "older" data back.
When I assign host access to the snapshot volume QS assigns an already used and active LUN number to this volume.

For exampe, I have 3 volumes active with LUNs 2, 34, 62. I now assign host access to one of the snapshots and QS assigns LUN 2 to it. (instead of any unused IDs 1..1023)

Hopefully this "assignment" is not breaking the already active (and actively used) LUN 2 but just prevents the "newly assigned volume" to be accessible.

There is no HA pools or anything like that, just plain FC access.

You mention

There is a LUN assignment policy you can adjust at the command line via volume-modify iirc

Can you share some more info about that? As I'm unable to find any LUN assignment/pinning commands in the qs*** CLI tools nor in the GUI.

Thanks for all your help and time!

Best regards,
M.Culibrk

June 26, 2018 | Registered CommenterM.Culibrk

Hi M,
QuantaStor assigns LUN numbers based on the next available for the specified Host. For example Host A could have volumes V1, V2 with LUNs 1 and 2 respectively. Host B could have volumes V3 and V4 also with LUNs 1 and 2 respectively. We could take the strategy that only one LUN number can be uniquely used forever for a given volume but that will lead to large gaps in LUN numbers which creates problems for various operating systems. So the key bit is that the LUN numbers are relative to the host that they're assigned to for optimal packing. Operating systems (Linux, Windows, VMware, etc) all use the SCSI page 0x83 device descriptor to uniquely identify devices so the LUN number itself is inconsequential. To identify which device is which we use the Storage Volume UUID as part of the page 0x83 identifier and on linux systems you can get this by running commands like sg_inq /dev/sdN but better yet is to just use the /dev/disk/by-id/scsi-N devices as then you don't ever need to think about paths under /dev/sdN which can move.
In the default dynamic mode when a Storage Volume is assigned to a Host for the first time it will try to use the LUN number on the Storage Volume as the preferred LUN but if that's already taken it'll reallocate a new one. If you have static assignment and the LUN number is in use then the LUN cannot be mapped out so we put a warning message into the log. This is why static mode shouldn't be used unless there's a really specific use case that would require it.

There is some ability to statically assign a specific LUN number to a Storage Volume using the QuantaStor CLI. IIRC "qs volume-modify" has some arguments for choosing static or dynamic LUN assignment mode. The static mode is not preferred as it can lead to the gaps in LUN numbers problem that you ran into before where they were just always increasing.

With QuantaStor HA Pools we also take into account the LUN numbers assigned for a given Storage Volume from the perspective of both of a Pool's primary and secondary nodes for failover to avoid picking a LUN number which is in use. At times users will want to assign and unassign Storage Volumes so the LUN number is kept on the Storage Volume even in Dynamic mode but the system has the right to change it as needed in that mode in case the number is taken.

With regard to iSCSI, QuantaStor uses LUN 0 for all Storage Volumes including snapshots since each one gets a unique IQN associated with it.

One last bit, if you try to access a snapshot from VMware by mistake, VMware will not import the LUN unless you explicitly tell it to. This is because VMware doesn't care what LUN number is assigned to a device, what it cares about is two IDs. The UUID it writes to the device to identify the Data Store and the SCSI 0x83 Device Descriptor which maps to our Storage Volume UUID. The snapshot has a different UUID and VMware detects that and essentially says "Hey, I see a device here with the DataStore on it but this isn't the original, this is a clone or snapshot, what do you want to do with it?".

Hope that helps and sorry it took so long to get back to you M.
Steve

September 1, 2018 | Registered CommenterSteve