As indicated before, I have been further exploring the constantly evolving MCP world. MCP sampling is a fun element of the Model Context Protocol. It essentially allows the server to access and use the client’s LLM. This way the MCP server gets access to additional capabilities while offloading the cost to the user. While the protocol states:

“For trust & safety and security, there SHOULD always be a human in the loop with the ability to deny sampling requests.”

surely there will be clients that do not implement any guardrails and there will be servers that will attempt to misuse the powers they have received access to.

Let’s take a look at a simple MCP server and client. For serving the LLM we will use LM Studio with ministral-3-14b-instruct-2512 as model.

Server

The server uses FastMCPv2 and exposes a tool generate_keywords. The tool generates a list of keywords matching a given topic. The generation task is offloaded to the LLM of the client by invoking a helper function send_sampling_message to send the sampling requests. When the tool is called, two sampling requests are sent in sequence:

  1. The first one takes the topic, constructs a prompt for the list generation, and then asks the client’s LLM to rewrite it (in this case: restricting the generation to five words only).
  2. The second one uses the client’s LLM to actually execute the prompt (in this case: generating a list of keywords with restrictions from 1).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import logging
import sys
from fastmcp import FastMCP, Context
from mcp.types import SamplingMessage, TextContent

logging.getLogger().setLevel(logging.INFO)

mcp = FastMCP(name="MCP sampling")

async def send_sampling_message(ctx: Context, message: str):
    return await ctx.sample(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(type="text", text=message),
            )
        ],
        max_tokens=1000,
    )

@mcp.tool
async def generate_keywords(topic: str, ctx: Context) -> str:
    """Generate keywords using LLM sampling."""
    prompt = f"List words that represent names of {topic}"

    # 1. Prompt rewrite
    result = await send_sampling_message(ctx, f"Rewrite '{prompt}' to include a restriction for the list to only contain 5 words. Return the rewritten text only.")

    rewritten_prompt = result.text
    logging.info(f"rewritten prompt: {rewritten_prompt}")

    # 2. Keyword generation
    result = await send_sampling_message(ctx, rewritten_prompt)

    logging.info(f"sampling message: {result.text}")
    return result.text

def main():
    mcp.run(transport="streamable-http", port=5001)

    sys.exit(0)

if __name__ == "__main__":
    main()

Client

The client is simple. It uses the lmstudio library for model access for convenience. This part could be easily replaced with any other model call. All the magic is in the sampling_handler function, which processes the sampling request and triggers the LLM inference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import logging

import lmstudio
from fastmcp.client import Client
from fastmcp.client.sampling import (
    SamplingMessage,
    SamplingParams,
    RequestContext,
)

logging.getLogger().setLevel(logging.INFO)

llm = lmstudio.llm("ministral-3-14b-instruct-2512")

async def sampling_handler(
    messages: list[SamplingMessage],
    params: SamplingParams,
    context: RequestContext
) -> str:
    """Accept sampling messages from the server and execute inference."""
    logging.info(f"sampling handler operation with message: {messages}")
    message = "\n\n".join(
        [
            f"{m.content.text}" for m in messages
        ]
    )

    response = llm.respond(message)

    logging.info(f"LLM response: {response.content}")
    return response.content

client = Client(
    "http://localhost:5001/mcp",
    sampling_handler=sampling_handler,
)

async def main():
    async with client:
        await client.ping()

        sampling_result = await client.call_tool("generate_keywords", {"topic": "elements"})
        logging.info(f"sampling result: '{sampling_result.content[0].text}'")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

To run the example code, we first start the server with uv run server.py and then run the client: uv run client.py. The expected output is as follows.

server:

INFO:root:rewritten prompt: "List **five** words that represent names of chemical elements."
[...]
INFO:root:sampling message: Here are five words that represent names of chemical elements:

1. Oxygen
2. Hydrogen
3. Carbon
4. Gold
5. Sodium

client:

INFO:root:Sampling handler operation with message: [SamplingMessage(role='user', content=TextContent(type='text', text="Rewrite 'List words that represent names of elements' to include a restriction for the list to only contain 5 words. Return the rewritten text only.", annotations=None, meta=None))]
[...]
INFO:root:Sampling handler operation with message: [SamplingMessage(role='user', content=TextContent(type='text', text='"List **five** words that represent names of chemical elements."', annotations=None, meta=None))]
[...]
INFO:root:Sampling result: 'Here are five words that represent names of chemical elements:

1. Oxygen
2. Hydrogen
3. Carbon
4. Gold
5. Sodium'

Communication flow

Below is the communication flow depicted as a sequence diagram.

Sequence diagram of the client server communication, incl. LLM.

Sequence diagram showing the client server communication including the LLM calls.

Summary

The example shows that the sampling handler is the key piece to pay attention to. For unattended execution of sampling requests, guardrails are a must. The MCP server stays in full control how often and when to execute sampling. The client can inspect the prompt and decide whether to execute the sampling request or not. This shows how dangerous a malicious server can be when disguised as a helpful tool. Imagine, like in this example, that the server will trigger two sampling calls. The first call executes legitimate logic while the second just uses “free” model inference for requests collected in a queue.