README.md

    Jina logo: Build multimodal AI services via cloud native technologies · Model Serving · Generative AI · Neural Search · Cloud Native

    Build multimodal AI applications with cloud-native technologies

    PyPI PyPI - Downloads from official pypistats Github CD status

    Jina lets you build multimodal AI services and pipelines that communicate via gRPC, HTTP and WebSockets, then scale them up and deploy to production. You can focus on your logic and algorithms, without worrying about the infrastructure complexity.

    Jina provides a smooth Pythonic experience for serving ML models transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Jina makes advanced solution engineering and cloud-native technologies accessible to every developer.

    Wait, how is Jina different from FastAPI? Jina's value proposition may seem quite similar to that of FastAPI. However, there are several fundamental differences:

    Data structure and communication protocols

    • FastAPI communication relies on Pydantic and Jina relies on DocArray allowing Jina to support multiple protocols to expose its services. The support for gRPC protocol is specially useful for data intensive applications as for embedding services where the embeddings and tensors can be more efficiently serialized.

    Advanced orchestration and scaling capabilities

    • Jina allows you to easily containerize and orchestrate your services and models, providing concurrency and scalability.
    • Jina lets you deploy applications formed from multiple microservices that can be containerized and scaled independently.

    Journey to the cloud

    • Jina provides a smooth transition from local development (using DocArray) to local serving using Deployment and Flow to having production-ready services by using Kubernetes capacity to orchestrate the lifetime of containers.
    • By using Jina AI Cloud you have access to scalable and serverless deployments of your applications in one command.

    Documentation

    Install

    pip install jina
    

    Find more install options on Apple Silicon/Windows.

    Get Started

    Basic Concepts

    Jina has three fundamental layers:

    • Data layer: BaseDoc and DocList (from DocArray) are the input/output formats in Jina.
    • Serving layer: An Executor is a Python class that transforms and processes Documents. By simply wrapping your models into an Executor, you allow them to be served and scaled by Jina. Gateway is the service making sure connecting all Executors inside a Flow.
    • Orchestration layer: Deployment serves a single Executor, while a Flow serves Executors chained into a pipeline.

    The full glossary is explained here.

    Serve AI models

    Let’s build a fast, reliable and scalable gRPC-based AI service. In Jina we call this an Executor. Our simple Executor will wrap the StableLM LLM from Stability AI. We’ll then use a Deployment to serve it.

    Note A Deployment serves just one Executor. To combine multiple Executors into a pipeline and serve that, use a Flow.

    Let’s implement the service’s logic:

    executor.py
    from jina import Executor, requests
    from docarray import DocList, BaseDoc
    
    from transformers import pipeline
    
    
    class Prompt(BaseDoc):
        text: str
    
    
    class Generation(BaseDoc):
        prompt: str
        text: str
    
    
    class StableLM(Executor):
        def __init__(self, **kwargs):
            super().__init__(**kwargs)
            self.generator = pipeline(
                'text-generation', model='stabilityai/stablelm-base-alpha-3b'
            )
    
        @requests
        def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]:
            generations = DocList[Generation]()
            prompts = docs.text
            llm_outputs = self.generator(prompts)
            for prompt, output in zip(prompts, llm_outputs):
                generations.append(Generation(prompt=prompt, text=output))
            return generations
    

    Then we deploy it with either the Python API or YAML:

    Python API: deployment.py YAML: deployment.yml
    from jina import Deployment
    from executor import StableLM
    
    dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345)
    
    with dep:
        dep.block()
    
    jtype: Deployment
    with:
      uses: StableLM
      py_modules:
        - executor.py
      timeout_ready: -1
      port: 12345
    

    And run the YAML Deployment with the CLI: jina deployment --uses deployment.yml

    Use Jina Client to make requests to the service:

    from jina import Client
    from docarray import DocList, BaseDoc
    
    
    class Prompt(BaseDoc):
        text: str
    
    
    class Generation(BaseDoc):
        prompt: str
        text: str
    
    
    prompt = Prompt(
        text='suggest an interesting image generation prompt for a mona lisa variant'
    )
    
    client = Client(port=12345)  # use port from output above
    response = client.post(on='/', inputs=[prompt], return_type=DocList[Generation])
    
    print(response[0].text)
    
    a steampunk version of the Mona Lisa, incorporating mechanical gears, brass elements, and Victorian era clothing details
    

    Note In a notebook, you can’t use deployment.block() and then make requests to the client. Please refer to the Colab link above for reproducible Jupyter Notebook code snippets.

    Build a pipeline

    Sometimes you want to chain microservices together into a pipeline. That’s where a Flow comes in.

    A Flow is a DAG pipeline, composed of a set of steps, It orchestrates a set of Executors and a Gateway to offer an end-to-end service.

    Note If you just want to serve a single Executor, you can use a Deployment.

    For instance, let’s combine our StableLM language model with a Stable Diffusion image generation model. Chaining these services together into a Flow will give us a service that will generate images based on a prompt generated by the LLM.

    text_to_image.py
    import numpy as np
    from jina import Executor, requests
    from docarray import BaseDoc, DocList
    from docarray.documents import ImageDoc
    
    
    class Generation(BaseDoc):
        prompt: str
        text: str
    
    
    class TextToImage(Executor):
        def __init__(self, **kwargs):
            super().__init__(**kwargs)
            from diffusers import StableDiffusionPipeline
            import torch
    
            self.pipe = StableDiffusionPipeline.from_pretrained(
                "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
            ).to("cuda")
    
        @requests
        def generate_image(self, docs: DocList[Generation], **kwargs) -> DocList[ImageDoc]:
            result = DocList[ImageDoc]()
            images = self.pipe(
                docs.text
            ).images  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
            result.tensor = np.array(images)
            return result
    

    Build the Flow with either Python or YAML:

    Python API: flow.py YAML: flow.yml
    from jina import Flow
    from executor import StableLM
    from text_to_image import TextToImage
    
    flow = (
        Flow(port=12345)
        .add(uses=StableLM, timeout_ready=-1)
        .add(uses=TextToImage, timeout_ready=-1)
    )
    
    with flow:
        flow.block()
    
    jtype: Flow
    with:
        port: 12345
    executors:
      - uses: StableLM
        timeout_ready: -1
        py_modules:
          - executor.py
      - uses: TextToImage
        timeout_ready: -1
        py_modules:
          - text_to_image.py
    

    Then run the YAML Flow with the CLI: jina flow --uses flow.yml

    Then, use Jina Client to make requests to the Flow:

    from jina import Client
    from docarray import DocList, BaseDoc
    from docarray.documents import ImageDoc
    
    
    class Prompt(BaseDoc):
        text: str
    
    
    prompt = Prompt(
        text='suggest an interesting image generation prompt for a mona lisa variant'
    )
    
    client = Client(port=12345)  # use port from output above
    response = client.post(on='/', inputs=[prompt], return_type=DocList[ImageDoc])
    
    response[0].display()
    

    Easy scalability and concurrency

    Why not just use standard Python to build that service and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.

    Increase your application’s throughput with scalability features out of the box, like replicas, shards and dynamic batching.

    Let’s scale a Stable Diffusion Executor deployment with replicas and dynamic batching:

    • Create two replicas, with a GPU assigned for each.
    • Enable dynamic batching to process incoming parallel requests together with the same model inference.
    Normal Deployment Scaled Deployment
    jtype: Deployment
    with:
      uses: TextToImage
      timeout_ready: -1
      py_modules:
        - text_to_image.py
    
    jtype: Deployment
    with:
      uses: TextToImage
      timeout_ready: -1
      py_modules:
        - text_to_image.py
      env:
       CUDA_VISIBLE_DEVICES: RR
      replicas: 2
      uses_dynamic_batching: # configure dynamic batching
        /default:
          preferred_batch_size: 10
          timeout: 200
    

    Assuming your machine has two GPUs, using the scaled deployment YAML will give better throughput compared to the normal deployment.

    These features apply to both Deployment YAML and Flow YAML. Thanks to the YAML syntax, you can inject deployment configurations regardless of Executor code.

    Deploy to the cloud

    Containerize your Executor

    In order to deploy your solutions to the cloud, you need to containerize your services. Jina provides the Executor Hub, the perfect tool to streamline this process taking a lot of the troubles with you. It also lets you share these Executors publicly or privately.

    You just need to structure your Executor in a folder:

    TextToImage/
    ├── executor.py
    ├── config.yml
    ├── requirements.txt
    
    config.yml requirements.txt
    jtype: TextToImage
    py_modules:
      - executor.py
    metas:
      name: TextToImage
      description: Text to Image generation Executor based on StableDiffusion
      url:
      keywords: []
    
    diffusers
    accelerate
    transformers
    

    Then push the Executor to the Hub by doing: jina hub push TextToImage.

    This will give you a URL that you can use in your Deployment and Flow to use the pushed Executors containers.

    jtype: Flow
    with:
        port: 12345
    executors:
      - uses: jinai+docker://<user-id>/StableLM
      - uses: jinai+docker://<user-id>/TextToImage
    

    Get on the fast lane to cloud-native

    Using Kubernetes with Jina is easy:

    jina export kubernetes flow.yml ./my-k8s
    kubectl apply -R -f my-k8s
    

    And so is Docker Compose:

    jina export docker-compose flow.yml docker-compose.yml
    docker-compose up
    

    Note You can also export Deployment YAML to Kubernetes and Docker Compose.

    That’s not all. We also support OpenTelemetry, Prometheus, and Jaeger.

    What cloud-native technology is still challenging to you? Tell us and we’ll handle the complexity and make it easy for you.

    Deploy to JCloud

    You can also deploy a Flow to JCloud, where you can easily enjoy autoscaling, monitoring and more with a single command.

    First, turn the flow.yml file into a JCloud-compatible YAML by specifying resource requirements and using containerized Hub Executors.

    Then, use jina cloud deploy command to deploy to the cloud:

    wget https://raw.githubusercontent.com/jina-ai/jina/master/.github/getting-started/jcloud-flow.yml
    jina cloud deploy jcloud-flow.yml
    

    Warning

    Make sure to delete/clean up the Flow once you are done with this tutorial to save resources and credits.

    Read more about deploying Flows to JCloud.

    Streaming for LLMs

    Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems. However, these models can be heavy and slow and your users want systems that are both intelligent _and_ fast!

    Large language models work by turning your questions into tokens and then generating new token one at a time until it decides that generation should stop. This means you want to stream the output tokens generated by a large language model to the client. In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.

    Service Schemas

    The first step is to define the streaming service schemas, as you would do in any other service framework. The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the token ID:
    from docarray import BaseDoc
    
    
    class PromptDocument(BaseDoc):
        prompt: str
        max_tokens: int
    
    
    class ModelOutputDocument(BaseDoc):
        token_id: int
        generated_text: str
    

    Service initialization

    Our service depends on a large language model. As an example, we will use the `gpt2` model. This is how you would load such a model in your executor
    from jina import Executor, requests
    from transformers import GPT2Tokenizer, GPT2LMHeadModel
    import torch
    
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    
    
    class TokenStreamingExecutor(Executor):
        def __init__(self, **kwargs):
            super().__init__(**kwargs)
            self.model = GPT2LMHeadModel.from_pretrained('gpt2')
    

    Implement the streaming endpoint

    Our streaming endpoint accepts a `PromptDocument` as input and streams `ModelOutputDocument`s. To stream a document back to the client, use the `yield` keyword in the endpoint implementation. Therefore, we use the model to generate up to `max_tokens` tokens and yield them until the generation stops:
    class TokenStreamingExecutor(Executor):
        ...
    
        @requests(on='/stream')
        async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument:
            input = tokenizer(doc.prompt, return_tensors='pt')
            input_len = input['input_ids'].shape[1]
            for _ in range(doc.max_tokens):
                output = self.model.generate(**input, max_new_tokens=1)
                if output[0][-1] == tokenizer.eos_token_id:
                    break
                yield ModelOutputDocument(
                    token_id=output[0][-1],
                    generated_text=tokenizer.decode(
                        output[0][input_len:], skip_special_tokens=True
                    ),
                )
                input = {
                    'input_ids': output,
                    'attention_mask': torch.ones(1, len(output[0])),
                }
    

    Learn more about streaming endpoints from the Executor documentation.

    Serve and send requests

    The final step is to serve the Executor and send requests using the client. To serve the Executor using gRPC:

    from jina import Deployment
    
    with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep:
        dep.block()
    

    To send requests from a client:

    import asyncio
    from jina import Client
    
    
    async def main():
        client = Client(port=12345, protocol='grpc', asyncio=True)
        async for doc in client.stream_doc(
            on='/stream',
            inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
            return_type=ModelOutputDocument,
        ):
            print(doc.generated_text)
    
    
    asyncio.run(main())
    
    The
    The capital
    The capital of
    The capital of France
    The capital of France is
    The capital of France is Paris
    The capital of France is Paris.
    

    Support

    Join Us

    Jina is backed by Jina AI and licensed under Apache-2.0.

    Описание

    Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data

    Конвейеры
    0 успешных
    0 с ошибкой