Wednesday, 17 May 2017

Cancer Uk

>> good afternoon and thank you for joining us today for the nci speaker series. i'm eric stahlberg, director of the hbc initiative for frederick national labs for cancer research. as a reminder, today's presentation is being recorded and will be available via the cbiit website at cbit.nci.nih.gov. you can find information about future speakers on that site and by following up on twitter at nci_ncip. today we are happy to welcome dr. ian foster,

director of the computation institute, joining us through the university of chicago and argonne national laboratory. ian is also an argonne senior scientist in math and computer science and a distinguished fellow. the title of his presentation is, streamlined data, excuse me, streamlined transfer and sharing of large scaled sensitive data to advanced cancer research. with that i'll turn the floor over to dr. foster. >> ian foster: thank you, eric.

thanks everyone for coming. i was told that you normally sit in your offices and watch this via webex. so i appreciate you being here in person. so, i'm going to not talk too much about cancer research today, but i'm going to talk about something that i hope you'll find relevant. technologies have been developing for some time to streamline as i say in the title the sharing of data and, therefore, the analysis of that data. so i know that we're all, you're all busy on your [inaudible] and data sharing and analysis is one of the foci of that input, you know,

and i'd like to propose a thesis that people basically want to share data in many cases and what hinders them doing so is the friction associated with moving data from a to b and making it available to others. so we have been developing technologies that address some of those issues, and i'll explain how they work. first of all i thought i'd try an analogy. you can see how it works. so, this is a picture from wikipedia. it depicts a rather somewhat famous,

it involved one [inaudible] eisenhower back in 1919. the army decided they would try and drive across the united states to see how long it took and it took them 56 days plus 6 days resting and [inaudible] to about 9 kilometers per hour to get across the country because, of course, the road system was so primitive. moving forward a little bit you can now thanks to the interstate system that eisenhower actually, you know, i think inspired by that experience made occur, you can get across the country in 41 hours by road and, in fact,

you can do it by 5 hours unless you're afraid to fly. [inaudible] long time relative to another way you might move something across the country. you might get online and use basically a cloud-based service called fedex and, you know, in a couple of minutes if you forget to type in some information in less than a second if you're prepared to use an ati, you can initiate the movement of large amounts of material from one place to another. so, i'm not sure if the analogy works, but the point is that things that we used

to do ourselves sometimes very painfully we cannot outsource and automate by handing off to cloud services. in fact, most of industry says it's totally transformed over the last 10 years by various cloud-based services that allow you to automate and outsource a very wide variety of activities associated with running your daily lives, your home life, your company. just to give a bit of structure to this discussion of cloud, just to remind you that cloud comprises of 3 components. we've got our infrastructure as a service,

which is where people who want to do high performance computing sometimes go to perform various sorts of data analysis. we've got out software as a service, which is where fedex and netflix and other things that we use on a daily basis. and then there's this enabling thing that makes it easy to build software as a service called platform as a service. so we'll come back to some of those categories a little bit later. so, we decided about 6 years about answering the question what could we take out of the scientific workflow

and deliver as cloud services, as software as a service? and, in fact, we used a range of science applications to study that question including questions relating to cancer research. so i'm going to put up a couple of slides that depict some cancer research projects. so this is a slide put together i think in a moment of frustration by dr. olufunmilayo olopade at the university of chicago. she was trying to explain to us where all the data that she was working with was sitting and the difficulties in moving it from one place to another.

so you can, i won't go through it in detail, but the point is she spends a lot of time or more rather her students and post docs spend a lot of time managing simple tasks, apparently simple tasks, of moving data from one place where it was located to another place where it could be computed on and so on and so forth. a second example dr. lucy godley from the university of chicago, this shows the set of sites that participate in an international sequencing consortium, the runx1 consortium, which is looking at genetic basis for inherited hematological malignancies.

and so here she has in a way of similar problem to [inaudible] but now the science that she's talking about are not within a single lab but they're located straight around the world as far away as australia and south america. so if you look at these sorts of applications, these problems, i think probably many of you have experienced this list of areas of friction. these are things that turn out to be time consuming, often slowing down research, sometimes hindering people from even taking on problems.

things like moving data rapidly from one place to another, accessing data at other labs, controlling who is allowed to access your data and tracking and access it. discovering what data is available, computing on data at large scale and so on and so forth. so, we observe these areas of friction and build out a system called globus that basically provides you can think of this as fedex but instead of initiating transport of packages it initiates the movement of data from one place to another.

this is a little counter of how many megabytes would transfer. sort of fun to watch it if you get some time on your hands. it's up to about 190,000 pedabytes at the moment. so, i'll talk to you very quickly through what it does and then we can go into some more interesting information, but this is the world as it looks to us. the world is made up of storage systems. the storage systems may be associated with an experimental facility with a publication repository, a [inaudible] facility somewhere this could be nih helix,

maybe your personal computer. both storage systems are increasingly have globus installed on them. there's about 10,000 active end points cross the us and internationally at the moment and active end point is one that is used regularly. and then that's the world as we see it and then this is a [inaudible]. they can come in and start using the web interface or perhaps an ati if they're a programmer to initiate the transfer and then globus manages it, moves their files, their 1 million files, their gigabyte or their terabyte from the sequencing center to the compute facility.

and that's something you could do because they took an [inaudible] runs globus end point, nih helix does, amazon runs them. so there are all places you could move data in this manner and it does it reliably. we were talking just before this about the challenges inherent in someone sending you 8 million files describing 100 patients and only 799,999 arrive, which is the missing file? how do you find that?

once you have moved it somewhere, you can then share it with other people. so, people are increasingly using this to provide data from scientific instruments to collaborators. so, you select the file, select who you want to share it with, off you go, [inaudible] and then you share it with them and access it. just 2 more things i'm going to talk about on this slide. we've also more recently have added support for publication workflows because we observed that once people have produced some data they often want to then give it a digital identifier and give it a name,

put it into a repository along with the associated metadata and then allow other people to download it, discover it and download it. so that in a sense is globus center slide. and key points are that you only need a web browser to do all this stuff. you can have your data on any storage system. it could be on nih systems, university systems, fund [phonetic] systems, and you can access it using any credentials. so, university [inaudible] credentials, other similar things. okay. so how are we adding value?

very easy to use. so data movement is now click of a mouse. very reliable, [inaudible] important as data sizes grow. it's very easy to share data with others. we support all sorts of different security credentials. we optimize wide around performance. so if you are moving a terabyte or even a gigabyte, that becomes important. it's very easy to deploy. so if you wanted to install globus connect on your computer,

you could do it in a few minutes, few seconds. you could try it and tell us how long it takes you. and then we'll come back to this, but it's highly automatable. so, most people use the web interface to access it, but you can build it into applications and that ends up being very powerful. okay. so i'll just show you a few screen shots. this is i'm logging into globus the website using my university of chicago credential. i get sent back to chicago to authenticate using my,

actually my colleague's credential. once i have done that i get asked what permissions i want to give to globus on behalf of this credential. this is the sort of thing you might be used to if you use a facebook or google credential for authentication. so here i'm saying i want to allow this web app to transfer files, view my identities and manage groups. we'll come back to what those are in a second. i combine multiple identities together.

so in the [inaudible] and a lot of resources i access using my university credentials but then sometimes i'll access systems at a national facility, which has a different sort of authentication scheme. so i combine those together and then use, globus will use the appropriate one as required. and then i can also use these [inaudible] to authenticate to other things. so these are the national science foundation's xsede national supercomputer network uses globus to authenticate.

the canadian compute canada system uses globus to authenticate. so, there's a growing number of things you can connect directly to using this system. i mentioned sharing. so let me just say a few words about this. so i bought my directory on my computer and i want to share it with my colleague at the illinois institution or even in the same institution. so how would i do that? normally i guess i would, i could put it on a cd rom.

i could get them an account on my computer. i could move it to something like drop box that maybe drop box it's large for that. i can simply click on that directory and set permissions to allow my colleague to access it and then they can come in and read and if i give them permission to write, write the data. so sharing becomes trivial. we'll skip that. so i mentioned before our view of the world.

we have this data fabric storage systems around the country which are globus enabled which means i can transfer to and from them, share on them if i have permission. so, what are these storage systems? well, we have storage connectors as we call them, all sorts of systems. linux, windows, macos, of course, various forms of high performance storage systems, but also less conventional things like the high performance storage system which is a high [inaudible] data management system.

hdfs, which is often used for [inaudible] produced parallel computation. amazon s3 file storage. various other specialized ones like ceph and spectra logic. google drive, which a growing number of universities are using to provide storage to their researchers and students. so you can access these all in exactly the same way. you don't need to be aware of the fact that you're accessing the cloud here, a [inaudible] here and another system somewhere else. i mentioned performance.

so some people still use secure puppy [phonetic], scp, to move data. so this is a transfer across the country not quite as far as eisenhower went but similar from chicago to berkeley. and we're getting up to about the 8 gigabyte per second here using gridftp and using conventional ftp [inaudible]. we're getting close here to the 10 gigabyte peak of this [inaudible]. okay. i want to mention and say just a few words about our publication service because this may be of interest to some. so, as i said, once you've got data somewhere you want,

what might you want to do with it? you might want to share it informally with someone, which is what the various sites that deliver the data to people may want to do. or you may want to publish it by which we mean give it a long-term credential. one second here. i think the computer [inaudible] is lost. this is my [inaudible]. yes, you have a question? [ inaudible ]

>> ian foster: that is an excellent question. so, when i talk about a national data fabric, so we i think are an important part of it and allow you to move data over it, but another important fact, which we see being increasing deployed at major research universities are the so called science [inaudible]. these are systems that are set up to be outside the corporate firewall that is connected directly to the internet too or other networks and they typically provide 10 or more gigabytes per second of performance. now we do very well if we're running over 100 megabytes

or a gigabyte, but we [inaudible]. so i think national initiative and [inaudible] of science needs to not only provide access to data it needs to apply access to networks. the national science foundation has put a lot of money into subsidizing the [inaudible] to universities and so there are quite a few of them, but probably more are needed. okay so we're back online. just a few words about data publication. so, we've still got the data repository as part

of the bd2k center we're involved in for the bdds data repository, and this will allow anyone who wants to upload the data and get assigned a digital update identifier and other people just search for and discover the data. so far we don't have much data in this biomedical one. it's just a test. we have more in another one that we're operating for another community for the materials community, which is [inaudible]. you can see the interfaces.

you describe your interface, provide your data, provide standard metadata, you can describe where you want to publish it and so forth. okay, you can search. here i'm searching for liquid solid metal metallic mixtures, which you've never done. so, but you could also be searching, of course, for [inaudible] data and so forth. okay, so i just want to finish this part of the talk by mentioning globus is widely used.

so, i talked about 100,000, sorry, 10,000 endpoints, 50,000 registered users, we've moved about 20 billion files. we've done some other fun things. so we run this as a cloud service. so, the data that you move moves from your lab to your collaborator's lab to the management logic that keeps track of that runs on amazon computers and that allows us to provide very high availability because we run, replicate it across multiple amazon what's called availability runs. so we've managed to achieve at least [inaudible] over the last several years.

so that means, you know, it may be down for a few tens of minutes over a year-long period. >> may i ask a question? >> ian foster: yeah. >> so is the data actually going on amazon cloud? is it the software or is it [inaudible] software? and it's actually installed on your own [inaudible]? >> ian foster: yes. so if we were to go back just for a second to this picture.

>> would you mind repeating the question? >> ian foster: yeah, so the question was when someone here uses the globus web interface to request that data be moved from say a sequencing center to a compute facility, what is running where? what data moves where? so the data moves directly over what is that [inaudible] network is available from the sequencing center to the compute center. the globus software is running on the amazon cloud keeping track of the [inaudible] of that transfer which files are being sent,

which have not and being ready to [inaudible] transfers fail. >> at no point is the data actually coming in [inaudible]? >> ian foster: that's right. now, you could set up a globus endpoint on aws and then move data there, but that would be a [inaudible]. that's important for security. so what are we up to now? so, one thing we do is we keep track of usage of the cloud service so we know who is doing what with the service, and we send the subscribers,

i'll say more about what subscribers are in a minute, information on how the system is being used. so this is information on globus usage at nih. there's a number of different endpoints. the helix system is the biggest but there are also others. so you can see nih has been using it since early 2015 and it's been moving up to more than 100 terabytes per month and, you know, accordingly we're up to now over 120 users within the intramural system at nih. and we can also see where this is occurring.

so, this is helix, it's the biggest in terms of uses and usage, but there are a number of other systems that are starting to scale up as well. and i don't know actually what the others are except the data within the nih domain. so, we're starting to see significant usage of the system within the nih intramural program. so i mentioned the word subscription so let me say a few words about this because i think it's a very important part of what we're doing. so we're all based at the university of chicago.

we operate this system for the [inaudible] community. and we aspire to do that [inaudible] i guess that's a long time, but as long as we are, as long as people want to keep using the system. one way we have achieved that goal of sustainability is we have a subscription system. so, anyone can move data between a and b but if you want to do things like manage sharing then you have a subscription. so nih has a subscription and the university of chicago does roughly 40 institutions in the us

and overseas have subscriptions. and those who have subscriptions get some additional capabilities. so, shared endpoints, data publication, management console. so this is showing transfers that are currently underway at, this is at the lawrence berkeley labs national energy research supercomputing center. so you see a couple of very large transfers and some small ones occurring at that time. and a few other capabilities.

so, subscriptions end up being i think an important part of this whole i think one of the innovative approach we're taking to deliver research data management services. this is the identity of some of the current subscribers to the globus system. i'm trying to see if any there are any surprising ones there. so we have the welcome trust in the uk; they're quite a bit user. i'll say more about that a little bit later. otherwise they are mostly major research universities and national labs. okay, now so far i've talked about globus as a software as a service.

so it runs on amazon, you use it to web interface to go in and ask for things to happen, data to move, data to be shared, data to be published. but importantly we also have this is going to get pretty icky, but we have apis. we have ways of integrating those services into your applications. so it can serve as a platform that will allow you to build other sorts of capabilities on top of the globus services that we provided. i'll give you some examples, but first of all just point out, you know, did the whole separate part of the globus website that there's documentation for our apis, transfer api and there's a python fdk,

there's all sorts of jupiter notebook showing how to use these things. we're going to be presenting by the way a workshop at nih in january on how to use these things. there's a command interface for those who really want to get down to the low level so you can write at the command line in your macos system commands to perform transfers. so, i was asked earlier about who has access to high speed networks. so this slide sort of speaks to that. so, most research universities now run what people call a science dmz.

so, the idea here is that this directly the internet was deployed, it was ubiquitous, there were no barriers to its use so you could move data in and out of any computer and then hackers appeared and so universities started deploying firewalls to protect their administrative data and now suddenly you couldn't move data in and out of universities rapidly because of the firewall getting in the way. so the science dmz idea is that you move some things outside your administrative firewall, for example, the file systems that maintain your research data, and then keep other systems behind the firewall, [inaudible] firewall

and then build very high speed networks the things called data transfer nodes. this is our es net energy sciences network, slide, and you'll see those little globus logos on it. they recommend globus be used to manage flows in and out of these things. and then suddenly your intertwined system stays very secure. the data can move very rapidly and in a very controller manner in and out of your institution because it runs over a separate path from your administrative data. so, globus leverage is the source of science dmzs.

so this is when we come and talk to you in january, we'll talk about how this lets you build very powerful data polls [phonetic] that will deliver data to communities very easily because as we'll show in the tutorial you can build a fairly simple piece of code that has a very specific logic that decides who is allowed to access your data, how they search for the data, how they look best to be downloaded and then you handle to globus the contributing the data from your science dmz to wherever people want to access it from. and then data movement and security issues are handled by api calls

to globus services that are running on the cloud. so people like the national center for administrative research use this to deliver all of their environmental data. yes? >> in this case, we're not touching amazon at all? >> ian foster: this is all running on amazon. yes. >> so you're still going outside the network. >> ian foster: yes, right.

but specifically don't mind you making outgoing calls to a cloud service. so that's use is acceptable, but what they don't want typically is calls, and they don't mind an http [phonetic] message coming in to make a request. what they don't want is big data transfers going in and out through the firewall. so that seems to be, of course, every institution's policy may be different, but this seems to be satisfactory in many situations. so here's one example that i just came across recently. so the welcome trust institute.

they run a little system where you can upload your [inaudible] data and get that, a list of imputing genes. i've done it. it didn't make any sense to me but it probably would to you. but it looks like a regular website. if you look here, you'll see that you log on with your globus identity, use globus to upload the data, they use globus to manage access to the computational results so you get pointed to a website from which you can download the data.

that's all handled using this machinery here. so they implemented this part. they used globus as the data download system and the folks who manage authentication and data movement. so we are starting to see an awful lot of people starting to build systems like this using these services. and so as i said we're running this, you're not yet on the calendar, but we have this 2-day tutorial. in berkeley in a couple of weeks it'll be at yale

and it be coming here in [inaudible] in january. okay, so i've described globus as a software to service, globus as a platform. i wanted to give one additional example of how you can leverage the services to do interesting things and here i'm going to describe another system developed at the university of chicago, which is confusingly also called globus. it's globus genomics. it's basically a, are you familiar with galaxy? the galaxy workflow system? so it's galaxy packaged up to use the globus mechanisms

to run data analyses using [inaudible] workflows on the amazon cloud. so this picture here is intended to capture this motion. data is generated, you know, it's sent to a sequencing center, it can then be shared with collaborators, it will be moved to scalable storage on amazon. computed on using cloud computing and [inaudible] galaxy without the data ever needing to enter the researcher's lab and without the user ever having to move anything manually it's all handled completely automatically.

it's another nice example of how one can reduce friction associated with what is perhaps intellectually simple but in practice quite complex task of [inaudible] some sequencing data, analyzing it, sharing results with people and so forth. and this has been used for a variety of things. it's probably i think 50 groups across the us who use this service, mostly [inaudible], to do all of their sequence analysis. they don't run any software on their own computers they just sign up for this globus genomics service and run their analyses on the amazon cloud

and keep their data there as well. and [inaudible] is the fellow who released this project. he [inaudible] the numbers slide so he put one together as well. so he's got 30 institutions or groups. he's processed 10,000 genomes, 100s of species, but i'm told that most of them are bacterial so it doesn't really count i guess, but there are quite a few humans as well as mice. some of these groups they're pretty large. we've got 75 people in one group they're using the service.

they've processed 2 pedabytes of raw sequenced data generated by the output data. okay, so i, yes, go ahead. >> ian foster: yes, globus genomics, which again uses some of these globus services i mentioned that has each of them as well. what they've done is taken some, many of the tools built into the galaxy system. used them to build a set of standard workflows and then packaged them up in virtual machines so they can run on the amazon computers.

>> ian foster: no, it does not. >> it doesn't have breakdown [phonetic]? >> ian foster: no.the globus platform itself is all about data movement, data sharing, data publication. not about computing. although i'm going to tell you a bit about our future directions which will touch on some computing. >> ian foster: yeah, absolutely. so, it will and basically the data sequence [inaudible] institute,

it's being transferred to amazon storage using globus and then the globus genomic system will run galaxy workflows on the data wherever it's sitting and they make the data available, the output available to a shared end point. >> just curious are you being charged for this? i'm just curious actually this specific service of doing the analysis? because we're using obviously the cloud resources. >> ian foster: so, right, so this group has, you know, has [inaudible] outside groups using it.

so these people will each pay for the amazon resources that they use. >> okay. >> ian foster: in a way as you know when you run computing on the amazon cloud or the google cloud, you pay for what you use, which is essentially a wonderful thing if you want to provide a [inaudible] service to people because the more people who use it you don't have to pay more they pay more. so, it's quite an attractive way of delivering computing to people if you're a school [inaudible] as the globus genomics group is.

okay, i want to now spend a few minutes talking about we're going in the future. so, we've built this national data fabric that lets you move data between these endpoints. we've provided apis and will let you integrate them into applications, but of course, they've got lots of things that we either are doing now or want to do in the near future. some things are coming very soon. at the moment when you access a globus endpoint, you use it, you do it using a specialized protocol called gridftp.

so this is a very good protocol for moving things rapidly over wide area networks, but it's also good if what you want to do is just simply grab a file from a storage system, you know, to display it in the browser, for example. so historically what you see is people running these rather curious hybrid computing centers where some data sits on these high performance storage systems designed for access [inaudible] gridftp. and others like web pages, you know, web servers,

sit on a separate set of storage systems. so we're building now into every globus endpoint http support so that you can mix these things together. you can have on your storage system the big files that you want to share and move using specialized protocols and maybe the small files, the thumbnails or the website content that you want people to access. so we think that's going to allow for a lot of interesting and new ways of organizing and distributing data. another thing we're adding halfway through doing this is adding support

for grouping sets of files and associating metadata with them. so, when people move or store millions of files, it's typically from logical arrangement associated with that data and one would like to be able to represent that and capture it so that when we're moving data we know what we're moving, what we haven't moved. when people search for data, they can find what they're searching for. and so this is the third thing we're adding at the moment is support for data search. so we want to allow people what these 10,000 endpoints.

not everyone can access all of them, but many people can access many and they have millions of files, billions of files associated with them. so they might allow people to start to discover what exists in the set of end points that is accessible to them. so we're doing this, and this is something we've been talking to people at nci about also automatic metadata harvesting. so at any directory on a file system that you have access to you'll be able to place a file, [inaudible] metafile we call it with some metadata in it. this could be metadata that you extracted from the file, the net directory,

it could be metadata describing a collection, it could be metadata that they use it for the [inaudible]. and then we'll have global features that will [inaudible] and allow people to search across the 10 different storage systems that you have access to and know what is where. okay, so those are some things that are coming soon by probably the middle of next year we'll have early versions. it depends. so this actually we hope by the end of this year, middle of next year.

i'm not sure of the schedule for that. some other things we want to do add support for additional storage systems. increasingly, you know, [inaudible] are using things like fox and drop box. so we want to be able to integrate those into the system. universities are coming to us and saying users are putting all of their data into fox and drop box. we want to be able to access them using the same protocols as we use to access other data. then you asked this question can we run code on endpoints?

not at the moment, but we want to allow that as well so that you can say here's an endpoint 1, this software on this code, on this data, and perhaps put the result into a metadata file or return it for future use. okay, so those are things that we sort of have on the immediate roadmap. other things we want to do and in fact we're working with various groups like the us census on are turning this current global system, which is a platform for data sharing and movement into a platform for doing those things on human subjects data.

and we see this involving a couple of things such as the sort of maybe the easier part of it allowing for secure movement of data. so that means with the globus logic that people asked about that runs on amazon computers needs to be set up to run on fedramp authorized services. does anyone here know what fedramp is? it's very complicated. these are the laws that govern what you can do or not do on the cloud with sensitive data. and we've got a few other ideas that we want to get up there.

but in the, what we are working on with the us census is perhaps yet more interesting. we want to allow people to, people with very sensitive data like criminal justice or census data to upload this to what we call safe collections running on amazon cloud computers. do safe search on that data. so you will only see the things you are allowed to see. have access be approved by data stewards associated with data collections and maybe analyze the data in safe workplaces and work places that are set

up to only run [inaudible] that's been approved for that sort of analysis and finally data export. getting control also by these data stewards. so you can imagine how this would be of interest i think for cancer research especially when clinical data is being dealt with. i've said that. oh, so here's i've actually got to the point of having some mark ups of how these systems would look, look like for the census. so this is a user.

they've got a set of projects they are working on recidivism in urban settings. so these are ex-offenders and they've got various data sets that they've approval to work on. they've got a workspace, one or more workspaces for upgrading on it. there's a set of participants in their group and so forth. and then this is a data steward view, you know, keeping track of pending data access requests, number of projects working on, the datasets that they've provided, the number of datasets that are actually being used by those projects.

i emphasize these are just [inaudible] mockups. there's nothing behind them but we're gathering the requirements that motivate these [inaudible] frames so we're learning a lot about these communities. one final slide on where we want to go. so, at the moment our endpoints all they do is they sit there and they respond the requests to push or get data and that's basically what a globus endpoint does. so, at 10,000 endpoints they sit there and receive requests of data.

they'd like them to be more active in some situations. they'd like, for example, to be able to set up an endpoint associated with a scientific instrument and define a rule that will say what it should do when data gets created. so here's an example. so this is an initiative of some sort. it's a set of rules that say if a new file is created, then run a quality control scripts. if quality is good, then send it,

an email and transfer it to a long-term storage system. and then there's another storage system that has some rules if new files are created, these are the ones that are being transferred. and run some feature extraction programs. if a feature is detected, then maybe you transfer it to [inaudible] storage. you're still getting a feeling for what sorts of things people want to do about. we're working up how to implement these capabilities. so, suddenly one can start to think of these many end points as much as these things that sit there happily for someone to do something,

but things that can cooperate, can participate in maybe a data curation workflow, you can assure at any time someone created a new file there was some basic quality control from [inaudible] and maybe some feature detection activities that would allow you to catalog it. okay, so i have a few more slides. so i'm going to put up this one so i don't forget. so, our work is supported by the nih but also some other fine agencies as well as local support and the sloan foundation who has been helping us

with some sustainability issues. we have a wonderful team. i just listed a few key members here. having done that let me go on and say a few words about what i think we've learned and some thoughts that you might take away from here. so, my view is we're not just, in running globus, we're not just providing a data movement service. we're part of a larger movement that's building a national

or even global scale data fabric. so this data fabric comprises these finest dmzs storage systems of various sorts that are being set up in different places and what we contribute to this mix is the system that allowed, makes it very easy to add a storage system to this fabric. you basically [inaudible] connect software. and then anyone can then engage robust secure high-performance access. and then these cloud services which implement these capabilities that we've been talking about in very effective manners.

i think what's really innovative here is that we are limiting [phonetic] this idea of cloud computing software as a service and platform as a service to deliver these things that historically would have been why people were installing software on computers, which is inherently unreliable error prone approach to things. so a few thoughts, you know, it is being applied in cancer research already. [inaudible] that we are not yet [inaudible] up to deal with hipaa restrictions. it's spreading rapidly. scientists seem to like it.

people don't usually like infrastructure software but they do seem to like this. it's pretty widely deployed across universities and labs thanks i would say to initiatives from the national science foundation and department of energy to make sure that their storage systems can make the high speeds. i don't think we see quite the same thing yet in bioinformatics or biomedicine. and we've got this interesting path that it's a business model if you like innovation or technical innovation but i think it's important the path towards sustainability so that we run this long term without being dependent on research firms.

and then they're starting to be integrated into applications and [inaudible] like the [inaudible] service. so, you know, some thoughts of what i think needs to be done to really take what we've done and shown to work in physical sciences, in particular, environmental sciences, and really have a big impact in biomedicine. we need to integrate biomedical research facilities into the fabric. so that is happening actually at nci. nci seems to be leading in this area but where there are thousands,

i guess tens of thousands of other research institutes and hospitals and so forth that are not part of this discovery. this is a problem for us. so i guess nih would certainly be helping to [inaudible] by subscribing to these services to help sustain this fabric, which they will increasingly be depending on. we need to address issues of hipaa compliance so we can deal with phi. and then out of what we're trying to do with this upcoming workshop and many others.

some of the workshops just to cultivate an ecosystem where people are building data portals and applications. so this is a fabric. it lets you do simple things like move data, but if you want to deliver something specialized like a collection of imaging data and you want to integrate these capabilities into a specialized data portal. and if you do that, then that can interoperate with other data sources. and then we're going to keep adding capabilities as i've discussed.

so, thank you very much for your attention. [ applause ] >> dr. foster, i have a couple of questions on the webex. >> ian foster: okay. [inaudible]. >> so the first one says can globus online support third party transfer i.e. initiate movement of data from one endpoint to another endpoint without having to host the globus online server? >> ian foster: yes, in my little example earlier,

the researcher moving data from the sequencing center to the compute facility can simply connect to the website and use that. you only need to host some software if you want to move data to your own desktop or workstation. >> okay. and one more it's a different person. i have a question about data stewards. will they be system admins from the globus team or the data owners team? >> ian foster: no. the data stewards in the earlier discussion that we had the data steward here,

i didn't show this slide, is a person from the us census or the illinois department of corrections or the national cancer institute i guess who is responsible for perhaps defining policies and approving requests to access data. yeah? >> ian foster: no, i agree. so, when, every time someone moves data [inaudible] sometimes you need to move data to combine it with something else, but it will be better off to compute where the data is located.

so, i think people are always going to be moving data for many reasons, but we believe strongly that we want to allow, make it easy for people to build rich data services that integrate data analysis capabilities and that's, you know, what we're leaning towards with these websites ideas but we can only, we [inaudible] deliberately because we want to make [inaudible] system that we build very high quality and very high ease of use so that people make use of it. there's a question over there.

yes, that's right. so we're working right now on understanding fedramp compliance, which is in a sense -- -- and i think we're at the point where we understand easily what will be required and how much it will cost. we don't yet have the money to do it. on the internet [inaudible] issues i really know very little. we have, dr. godley tells me that part of her team are a few lawyers involved. so, i guess we can consult with them.

>> so a couple more questions from webex. one is are subscriptions for services based on fee per service model? meaning about amount time used or resources used? >> ian foster: yeah, okay. i can, good question. so, i introduced some confusion by talking about globus and globus genomics. so globus, the globus transfer service is a, typically an institutional subscription and there's no usage component to it. we initially tried using [inaudible] subscriptions and then we realized

that we were disincentivizing people from moving data, which was not, so now we do it based on the size of the institution. and i would say i don't actually know the numbers, these numbers are not very big. so, a typical university might be paying $20,000 a year, which don't quote me on that because i don't do the business side but it's not a lot of money. now the globus genomics which is a completely separate organization, they have a subscription-based thing for the people who use their services

and there i think it's a mix of paying for the compute time you use and maybe per genome [inaudible]. i'm not sure how they do it. >> okay. one more if you don't mind. are apis accessible as [inaudible] installs and maintains their own globus online server as opposed to the globus online service that is hosted on awf? >> ian foster: so we don't make the software that is posted on awf available for [inaudible] and that's a very deliberate choice because we believe strongly

that the way to acute sustainability is to have a critical mass of people using it. sorry, there was a question over there. i meant i was going to spend some time on that and then i had too much to cover so i didn't. so i'm glad you asked the question. so i believe that the common, my understanding is that it's a state of mind, it's a philosophy, right? where people share data and where friction

that might impede data sharing is eliminated where the friction is what we address, which is maybe [inaudible] people to share their data when they want to and track who is sharing it whether it's an issue of, you know, standards in terms of how you encode data or common tool for analyzing data. so i think we provide a very important element of this but certainly not the only element which is the fabric of which data sharing occurs. >> so, when you said subscription, is the subscription for one data center?

or is it, the question i have is let's suppose that an institution we have [inaudible]. >> ian foster: so, again, i don't know if you should quote me but my understanding is that the nih subscription probably any nih intramural program. >> thank you. >> ian foster: i think there's about 6 at the moment. you probably didn't know about them all. >> i know of 4.

so, yes, so anyone can set one up and then if you do that before the january workshop, you'll then be ready to ask good questions. thanks very much for the great questions and for your attention.

No comments:

Post a Comment