wiki:UploadingFiles
Last modified 8 years ago Last modified on 05/16/12 14:41:41

Getting data into the Galaxy

Following are different ways to get your data into the galaxy:

  • Browser based upload
  • 'FTP' upload
  • Data Libraries
    • Adding data as an administrator
    • Adding data as a non-admin user
  • Uploading using API - I haven't explored it in great detail yet, but feel free to try it out in personal galaxy instances. API is disabled in galaxy.uabgrid.uab.edu and even if it's enabled you won't be able to access it without passing external/shib auth. It's possible to go through external authentication using certs or completely bypass external auth for API, but that's something I will get into at later stage.

Browser upload

This option should be used for small files ~1GB size only.

'FTP' upload

Although I am mentioning the word 'FTP' here, we will be transferring files using SCP here. The galaxy server can treat a directory on the filesystem (say, ftp_upload_dir) as if it is being controlled by an FTP server. This allows galaxy to make certain assumptions about directory structure and other file handling mechanisms. For example, galaxy will delete the file in ftp_upload_dir directory after the import has been complete. However, it doesn't care how files are being deposited in this directory. Hence, we can use SCP to transfer files to this directory and configure galaxy to use it with it's ftp_upload_dir settings. Currently configured to use '/lustre/importfs/galaxy/$USER' , not configured automatically for all users.

Uploading files to a data library

The galaxy can be configured to upload files in particular subdirectory. In our galaxy instance we have configured it to use user's scratch directory for file input. So a user should be able to upload files in their scratch directories to data libraries. Following steps walk through file upload from user's scratch directory to a data library.

Getting access to data library First ensure that you have a data library to work with. See DataLibraries to request a new data library or get access to existing data library. You will need permissions to access data library and add items to data library.

Uploading datasets to cheaha

  • Create a sub-directory in /lustre/scratch/$USER directory for depositing datasets - say datasets-111.
  • Deposit files in this directory 'datasets-111' using scp.
  • Modify permissions so that galaxy can access directories and read files
    # Use setfacl command to grant read-execute permission to galaxy user 
    
    # Grant permission to access scratch directory  - this will allow galaxy to list all subdirectories in scratch directory in data library's upload page. 
    $ setfacl -m u:galaxy:rx /lustre/scratch/$USER
    
    # Grant permission on specific subdirectory recursively on all files - this will allow galaxy to access all files within a particular subdirectory 
    $ setfacl -R -m u:galaxy:rx /lustre/scratch/$USER/dataset-111
    
    
  • For removing galaxy's access to your scratch directory use following command:
    $ setfacl -x u:galaxy /lustre/scratch/$USER
    

Importing datasets in galaxy

  • Click on 'Shared Data -> Data Library -> my-datalib'.
  • Click on 'Add datasets' and then select 'Upload directory of files' in upload options.
  • Select the dataset sub-directory that you want to upload - 'datasets-111'.
  • Upload to library.

Notes

  • The galaxy will copy everything found in datasets-111 directory including sub-directory.
  • If there are multiples files zipped together in a single zip file, then galaxy selects only one file out of entire collection.
  • If we copy files into the galaxy then we should probably delete files in scratch directory.
  • If we create links to these files without copying into galaxy then we shouldn't delete these files blindly.
  • If we use 'create link - Link to files without copying in galaxy' option with a zipped file then it is not uncompressed by galaxy. It doesn't return any error as well.
  • Both ftp_upload_dir and user_library_import_dir point to the same location right now - This may cause some issues if users are not careful - user links to these files using data library options and then imports these files using ftp as well, then ftp will delete these files and data library will be in error! In this case the user won't lose these files, however it could be confusing to users.

Choosing between FTP and Data Library upload options

FTP Data Library
Configured location /lustre/importfs/galaxy/$USER /lustre/scratch/$USER
Depositing files Any tool that can work over ssh, e.g. scp. Or fetch data from internet using wget/curl on cheaha Any tools that can work over ssh, e.g. scp. Or fetch data from internet using wget/curl on cheaha
Does it automatically get into galaxy? Nope, requires additional step of 'Upload data' using web interface Nope, requires additional steps of getting access to data library and adding files to the data library
Does galaxy delete these files once they are imported in it? Yes No
Possible to 'link' files without actually copying it in galaxy? No Yes
Does it require any special sub-dir structure? No - galaxy will scan entire scratch directory recursively and list all files in the web interface Yes - It works only with sub-dirs, so even a single file that you wish to import should be in a sub-dir
Filesystem permissions Galaxy needs full (rwx) permissions on your scratch directory - write permission to delete files after import Galaxy needs read-execute (rx) permissions on your scratch directory and sub-directory you want to import
Suitable for.. When you need instant results or one-time use.. Long term use and sharing - as you can deposit more data, share it with others, label/describe datasets in a library..