18 Installing and Using Ollama on Windows and macOS

18.1 Introduction: Local Large Language Models and Ollama

Local large language models (LLMs) have emerged as powerful tools that run entirely on personal hardware, offering benefits like enhanced data privacy and offline capability[1]. By keeping sensitive data on local machines and not relying on cloud services, researchers can experiment with AI models securely and without internet access[1]. In recent years, an ecosystem of user-friendly frameworks has grown to support local LLM deployment. Notable examples include Ollama, Foundry Local, and Docker’s Model Runner[2]. These frameworks make it easier to download and run advanced language models on commodity hardware. Among them, Ollama has gained prominence as one of the easiest ways to get up and running with large language models such as GPT‑OSS, Gemma 3, DeepSeek‑R1, Qwen3, and more[3].

Ollama is an open-source tool and runtime that allows users to run open-source LLMs locally using simple commands, without the need for cloud APIs or complex setup[4]. In essence, Ollama provides a unified platform to download, manage, and interact with a variety of LLMs on your own machine[4]. It supports models for general text generation, coding assistance, multimodal (vision-language) tasks, and even embeddings for retrieval. Under the hood, Ollama runs a local server that hosts these models and exposes an API, enabling both command-line and programmatic interactions. Researchers and practitioners can prompt the models directly via a CLI or integrate them into applications through standard API calls, achieving functionality similar to cloud-based AI services but entirely on local infrastructure[5]. This chapter provides step-by-step guidance on installing Ollama on Windows and macOS, explains how Ollama facilitates local LLM usage, surveys some high-performing models available through Ollama (with a focus on coding and retrieval-augmented generation tasks), and presents practical examples of using these models. We also discuss how to integrate Ollama-hosted models into a Retrieval-Augmented Generation (RAG) pipeline with frameworks like LangChain or LlamaIndex, and we describe the Ollama API for programmatic access. The writing is aimed at interdisciplinary researchers in digital and data science fields, presented in an academic style with references to documentation and relevant literature.

Before diving into installation, it is useful to clarify what we mean by retrieval-augmented generation (RAG), as this is a recurring theme. RAG is a technique where a language model is combined with an external knowledge repository or database to enhance its factual accuracy and up-to-date knowledge[6]. In a RAG pipeline, the model retrieves relevant documents (often via vector similarity search) and uses their content as context when generating answers[6]. This approach has been shown to improve performance on knowledge-intensive tasks by grounding the model’s outputs in external information[6]. Ollama can play a critical role in RAG setups by providing both the embedding models for document retrieval and the generative models for answering questions, all running locally on a researcher’s computer. With this context in mind, we will start with how to install Ollama on the two major operating systems, and then explore its capabilities in detail.

18.2 Installing Ollama on Windows and macOS

Installing Ollama is straightforward on both Windows and macOS, with pre-built installers provided for each platform. In this section, we outline system requirements and guide you through the installation steps on Windows (Section 2.1) and macOS (Section 2.2). By the end of this section, you will have the Ollama application installed and running, ready to download and serve local LLMs.

18.3 Installation on Windows

System Requirements: To run Ollama on Windows, you should have Windows 10 (version 22H2 or newer) or Windows 11, either Home or Pro edition[7]. Ollama supports both NVIDIA and AMD GPUs on Windows, so if you plan to use GPU acceleration, ensure you have a compatible driver: NVIDIA drivers version 452.39 or newer, or the latest AMD Radeon drivers for your card[7]. Older GPUs without the required CUDA support may fall back to CPU execution. While Ollama does not require a GPU (it can run on CPU), using a modern GPU with sufficient VRAM is recommended for larger models and better performance. Keep in mind that running large models also demands significant RAM; for example, at least 8 GB RAM is recommended for 7B-parameter models, 16 GB for 13B models, and 32 GB for 30B+ models[8].

18.4 Installation Steps:

Download the Installer: Visit the official Ollama website and navigate to the Download page. Download the Windows installer (usually provided as OllamaSetup.exe) for the latest version of Ollama. This installer bundles the Ollama application and required dependencies. No administrative privileges are required – by default, it will install to your user’s home directory[9].
Run the Installer: Locate the downloaded OllamaSetup.exe and double-click it. Follow the prompts in the setup wizard. The installer will copy the necessary files into your home directory (under %LOCALAPPDATA%by default) and register the Ollama application. It will also add the ollama command-line tool to your PATH, so that it can be accessed from Command Prompt, PowerShell, or any terminal[5][10]. Because the installation is per-user, you won’t be prompted for Administrator approval.
Start Ollama (if not started automatically): After installation, Ollama will run a background service automatically. In Windows, Ollama typically adds a system tray icon and launches its background server on start-up[5]. The Ollama service exposes a local API at http://localhost:11434 and listens for model requests[5]. If the installer finishes without error, the service should be running. You can verify this by opening a terminal and running a quick command such as:

ollama run hello-world

(Here hello-world is an example model or prompt; you might substitute a small model name if needed.) If the service is running, the command will either begin downloading the model or produce a response. In case the service isn’t running, you can start it via the Start menu (find Ollama and launch it) or by running ollama serve in PowerShell to launch the server manually. 4. Disk Space Considerations: The base Ollama installation requires roughly 4 GB of disk space for binaries and core files[9]. However, this does not include the space for models. Each downloaded model can range from a few gigabytes to tens or even hundreds of gigabytes for the largest models[11]. By default, models are stored in your user profile directory (e.g., %HOMEPATH%.ollamaor a similar path) along with configuration. Ensure you have adequate storage in this location, or consider changing it (next step) if not.

(Optional) Change Installation or Model Storage Location: If you need to install the Ollama program files to a different drive (for example, a larger secondary drive), you can specify a custom directory when running the installer. From a command line, run the installer with the /DIR=“D:” flag to choose an alternate installation directory[12]. Likewise, to store downloaded model files elsewhere (e.g., on an external drive or another folder with more space), set the environment variable OLLAMA_MODELS** to your desired path before running Ollama[13]. You can set this in Windows by searching “Edit environment variables for your account” and adding a new user variable named OLLAMA_MODELS with the target directory path[14]. After changing this, restart the Ollama service (exit the tray app and re-launch it, or reboot) to have it pick up the new model location[15]. These steps ensure large model files won’t fill up your main drive.
Finish Installation: Once installed and configured, you can test Ollama by pulling a small model and running it. For example, open a PowerShell and execute:

ollama pull gemma3:1b # download a 1B-parameter model (Gemma 3 variant) ollama run gemma3:1b # run the model with an interactive prompt

The above commands will download the Gemma 3 model (1-billion parameter version, ~800 MB) and then run it so you can enter queries. You should see Ollama begin to download model weights if not already present, then eventually provide a prompt where you can type a question. This verifies that Ollama is functioning correctly on your Windows system. On first run, the download can take some time; subsequently, the model is cached locally for reuse. By default, Ollama’s Windows service continues running in the background, so you can invoke ollama commands anytime without manually restarting the server[5]. If needed, you can quit the Ollama background application via the system tray or using the Task Manager, and uninstall it through Add or Remove Programs in Settings (Ollama’s installer registers an uninstaller for easy removal)[16].

Note: Advanced users who wish to run Ollama as a standalone CLI without the GUI/tray can download a ZIP archive containing the Ollama CLI and required libraries (for instance ollama-windows-amd64.zip from the releases). Extracting this will give you the ollama binary which can be run directly or as a system service (e.g., via NSSM) using the ollama serve command[17]. However, for most users the standard installer is recommended, as it includes an auto-update mechanism to keep Ollama up-to-date with the latest improvements and model support[18]. 2.2 Installation on macOS

System Requirements: Ollama supports macOS Sonoma (v14) or later[19]. It works on both Apple Silicon (M1/M2 series chips) and Intel x86_64 Macs, although performance will be significantly better on Apple’s M-series chips due to their powerful neural processing capabilities. On Apple Silicon Macs, Ollama can utilize the GPU and neural engine for acceleration, whereas on Intel Macs it will run using CPU only[20]. Ensure you have at least a few gigabytes of free disk space for the installation and additional space for model files (which, as on Windows, can be tens of GB)[21]. Installation Steps:

Download the Installer: From the Ollama website, download the macOS installer, which is provided as a disk image file (.dmg). Make sure to grab the latest release compatible with your macOS version.
Install the Application: Open the downloaded .dmg file. This will mount a virtual disk and open a Finder window. Inside, you should see the Ollama application icon. Install Ollama by drag-and-dropping the Ollama app into your system’s Applications folder[22]. This copies the app to your machine. Once copied, you can eject the disk image.
First Launch and CLI Setup: Navigate to your Applications folder and launch Ollama.app. On first run, you may be prompted with a dialog asking if you want to move the app to the Applications folder (if you launched it from elsewhere) – since we already placed it in Applications, you can confirm or ignore as appropriate. The Ollama app will then verify whether the command-line tool ollama is accessible via your shell’s PATH. If it is not found, Ollama will prompt for permission to create a symlink to its internal CLI binary in /usr/local/bin (which is a standard location for user binaries on macOS)[21]. Granting permission will allow you to use the ollama command in Terminal without specifying the full path. After this one-time setup, the ollama CLI becomes available system-wide.
Background Service: Similar to Windows, the macOS installation runs a background server. When you open the Ollama app, it will typically start a menubar icon (look for an Ollama logo in the macOS menu bar) and begin running the local model server in the background. The server listens at http://localhost:11434 for API requests, just as on Windows. You do not need to manually start it each time; launching the app once (or logging in, if you set it to start on login) is sufficient. The CLI commands you run in Terminal will communicate with this background service. If needed, you can stop the service by quitting the Ollama app (for example, via the menubar icon’s menu or ⌘+Q if the app has a Dock icon) – though generally you’ll keep it running while working with models.
Disk Space and Model Storage: By default, Ollama on macOS will store model files and configuration in your home directory under ~/.ollama[23]. As noted, models can be large (several GB each), so ensure your Mac has sufficient storage in the home volume. If your home directory is low on space, you have a couple of options. One is to change the model storage directory by setting the OLLAMA_MODELS environment variable to a folder on a drive with more space (e.g., an external drive or secondary volume). You would do this by editing your shell profile (like ~/.zshrc or ~/.bash_profile) to export OLLAMA_MODELS=“/Path/To/ModelStorage” and then relaunching the Ollama app so it picks up the new location. Another option, if you prefer to keep the entire application elsewhere, is to install Ollama in a custom location: you can drag the Ollama app to a different folder instead of Applications, but you must then ensure the CLI symlink points correctly to Ollama.app/Contents/Resources/ollama in that location[24]. On first launch, if you chose a non-standard install location, decline any prompt to “Move to Applications?” to keep it where it is[25]. The key is that the ollama binary (inside the app package) needs to be on the PATH. Advanced users can manage this manually (e.g., by adding a symbolic link in /usr/local/bin themselves).
Completing Installation: With the app running and the CLI configured, you can now open Terminal (or iTerm, etc.) and test that everything works. Try running:

ollama pull codellama:7b-instruct # download Code Llama 7B (in instruct variant) ollama run codellama:7b-instruct # run the model with an example prompt

The first command will fetch the model weights for the 7B parameter Code Llama model (which might be around 3–4 GB download) and the second will start an interactive session with that model. You could, for instance, type a request like “Write a Python function to compute the Fibonacci sequence” and observe the model generating code (more on Code Llama in Section 4.2). This confirms that Ollama is properly installed and that the CLI communicates with the background server to generate answers.

Uninstallation: If you ever need to remove Ollama from macOS, you will have to delete the application and associated files manually (since macOS apps often do not come with an uninstaller). This involves deleting the Ollama.app from the Applications folder and removing the support files. According to the documentation, you should remove the symlink (/usr/local/bin/ollama if it exists) and the data directories such as ~/Library/Application Support/Ollama, caches (under ~/Library/Caches), saved state, and the ~/.ollama folder which contains downloaded models[26]. Be cautious when deleting these to avoid removing other needed files. After this, Ollama will be fully removed from your system.

Linux (Note): While this chapter focuses on Windows and macOS, it is worth noting that Ollama also supports Linux. Installation on Linux is typically done via a single shell script command (for example, using cURL to run the install script)[27]. This makes it easy to set up on servers or research clusters running Linux. The Linux version runs as a command-line service (ollama serve) and supports similar capabilities. Researchers comfortable with a Linux environment can thus also leverage Ollama in that context, for instance on a GPU-equipped Linux server.

With Ollama installed on your platform of choice, we can now explore what Ollama offers and how to make use of it.

18.5 Understanding Ollama: Running Local LLMs Made Easy

Once installed, Ollama provides a consistent environment for working with local LLMs. Its design centers around a client-server model: the background service (or “Ollama server”) loads and serves the models, while the user interacts through the CLI or API. This architecture means you can start the service once and then issue multiple queries to models without reloading them each time, which improves efficiency for iterative work.

Ollama Service and CLI: On both Windows and macOS, after installation, Ollama runs a local server process in the background[5]. The server is responsible for model management and inference. The command-line interface ollama that you use in Terminal or Command Prompt is essentially a client that sends requests to this server. By default, the server listens at http://localhost:11434 (11434 is the default port) for API calls[5]. When you run a command like ollama run “”, the CLI either fetches the model (if not already downloaded) and then sends your prompt to the local API, which runs the model to generate a response. This design abstracts away the complexities of model loading and hardware acceleration – you simply issue commands and get answers, while under the hood Ollama manages GPU/CPU utilization and model inference.

Hardware Acceleration: Ollama supports running models on both CPUs and GPUs. If a compatible GPU is available (as per the system requirements), Ollama will attempt to use it for model inference to significantly speed up generation. For instance, on Windows it supports NVIDIA CUDA and AMD ROCm for GPU acceleration[7]. On macOS, Apple’s Metal Performance Shaders are used under the hood for Apple Silicon GPUs, enabling efficient execution on M1/M2 chips. The use of quantized model formats (such as GGUF/GGML quantized weights) also allows large models to run with less memory than full-precision, making local execution more feasible. However, it is important to set expectations according to your hardware: a very large model (tens of billions of parameters) might run slowly or not load at all on a laptop with limited RAM or without a strong GPU. Ollama’s documentation notes that lightweight models (e.g. 7B parameters) can run on modern CPUs with ≥8 GB RAM, whereas larger models like 70B parameters generally require a GPU with at least 16 GB VRAM to run comfortably[28]. If you attempt to run a model beyond your hardware’s capability, you may experience slow responses or out-of-memory errors.

Storage and Model Management: By default, Ollama stores models and related data in a dedicated directory (~/.ollama on Unix-like systems, or %HOMEPATH%.ollama on Windows)[29]. Each model is typically downloaded as a set of weight files (often in a compressed format specialized for efficient inference). The first time you run or explicitly pull a model, Ollama will download these files from the internet (Ollama’s model repository). This one-time download can be time-consuming for large models but subsequent uses are offline. It’s worth emphasizing that model files can be very large – for example, a 30B parameter model might be tens of gigabytes in size[11]. Plan your storage accordingly. If needed, you can configure a custom model directory as described in the installation section, using the OLLAMA_MODELS environment variable to point to a path on a drive with sufficient space[13].

Ollama’s CLI provides a few key commands for model management: ollama pull to download a model, ollama run to run it (pulling it first if necessary), and ollama rm to remove a model from disk if you want to free space. You can list the models you have by checking the ~/.ollama directory or using ollama commands (as of writing, there isn’t a single command like ollama list mentioned in docs, but one can use the file system or forthcoming features). The Ollama library website provides a catalog of available models that you can search and pull[30]. Popular models often have shorter aliases or simple names, while specific variants are accessed with syntax like model:variant or model:size.

Capabilities and Extensions: Ollama is not just a basic wrapper; it includes advanced features to enhance the model usage experience. For example, it supports streaming output, meaning it can stream the model’s reply token-by-token to your terminal or application (much like how ChatGPT streams answers) for responsiveness[31]. It also has support for special prompting techniques: the documentation mentions “thinking” mode (which likely relates to chain-of-thought prompting), “structured outputs” (to enforce JSON or specific formats in responses), “vision” (for image inputs with vision-capable models), “embeddings” (generating text embeddings for retrieval tasks), “tool calling” (models can call external tools or functions), and “web search” integration[32][33]. These capabilities enable complex use cases such as using a local model to answer questions by searching the web, or to output a JSON structure reliably by following a system schema. The details of these are beyond the scope of this chapter, but it’s notable that Ollama aims to support many of the advanced features one might expect from cloud-based AI APIs.

For most general researchers, the main takeaway is that Ollama simplifies local AI model usage. You do not need to worry about converting models to specific formats or writing custom code to run them – simply install Ollama, use ollama pull to get a model, then ollama run or via API to query it. In the next section, we will discuss some of the well-performing models available via Ollama and how they align with different tasks. This will help you decide which models to use for coding tasks, text summarization, or implementing a retrieval-augmented pipeline.

18.6 Models Available in Ollama and Their Use Cases

One of Ollama’s strengths is its curated library of open-source models. Through Ollama, you can access a variety of cutting-edge LLMs without manually hunting down model files or worrying about compatibility. The models span various sizes (from a few hundred million parameters up to hundreds of billions) and are fine-tuned for different purposes such as general conversation, coding, knowledge retrieval, and multimodal understanding. Below we describe a few notable models that are available, focusing on those particularly useful for coding and for RAG (retrieval-augmented generation) scenarios. Each description includes the model’s intended purpose and reasons it might be chosen for a given task, with references to documentation or model cards.

• GPT-OSS: GPT Open-Source Series. GPT-OSS is a family of open-weight models that aim to approximate the capabilities of proprietary GPT-series models using fully open-source weights[34]. These models are designed for powerful reasoning, tool use, and versatile language tasks[34]. They come in large parameter counts (e.g., 20B or 120B) and have broad knowledge. GPT-OSS is a good choice when you need strong general-purpose performance and are okay with running a larger model on a capable machine. Researchers have used GPT-OSS models for complex question answering and agentic tasks where the model might need to follow multi-step instructions or call external tools (with proper prompting). Essentially, GPT-OSS is an open substitute for an “GPT-3.5/GPT-4”-like capability in a local environment.

• Gemma 3: Gemma 3 is an open model (originating from Google research) known for solid instruction-following and general knowledge. It comes in various sizes such as 1B, 4B, 12B, and 27B parameters[35], allowing users to trade off performance versus speed. Gemma 3 has been highlighted as one of the models that approaches leading models in capability; indeed, it’s mentioned that Gemma 3, along with DeepSeek, offers performance close to top models like O3 (OpenOrca) or Google’s Gemini 2.5[36]. In practice, Gemma 3 is suitable for conversational agents and text generation where a robust, instruction-tuned model is needed. It may not be as well-known as Llama or GPT, but it represents a high-quality open model for everyday tasks. Additionally, specialized versions exist, such as FunctionGemma, a fine-tuned variant of Gemma 3 (270M parameters) specifically for function (tool) calling[37] – useful if one wants the model to reliably call APIs or functions in an agent setting. • DeepSeek‑R1: This is a family of open reasoning models that has gained attention for its strong performance on complex tasks. DeepSeek‑R1 models are designed to exhibit advanced reasoning and problem-solving, reportedly approaching the performance of leading-edge models in the same class[36]. For example, a DeepSeek‑R1 7B model can often compete with larger models in logical reasoning tasks due to its fine-tuning. Ollama lists DeepSeek‑R1 in sizes from 7B up to an extremely large 671B (the latter being a research-grade model requiring massive hardware)[38]. In typical usage, one might try the 7B or smaller variants for experimentation. These models can be useful for RAG pipelines as the “brain” that does the final answer generation after retrieval, especially if the question requires multi-hop reasoning or complex deduction. Thanks to their optimization for reasoning, they might handle chain-of-thought prompts well. A Reddit user described successfully running DeepSeek R1 locally with Ollama and noted that bigger models yield smarter AI responses (albeit at the cost of speed)[39].

• Llama 2 and Llama 3 series: Meta’s Llama family is well-represented in Ollama’s library, including not only Llama 2 (which was released openly by Meta) but also newer iterations referred to as Llama 3.x. For example, the Llama 3.2 and Llama 3.3 models appear, and even a placeholder for Llama 4 in early form[38]. These are cutting-edge open models with strong general performance. Llama 2 (especially the 7B and 13B sizes) became a popular baseline for many applications, and fine-tuned variants (like Llama 2 Chat) are available for dialogue tasks. Llama 3 (if referring to research versions that may be on the way or open reproductions) suggests even greater capability. In an Ollama context, you might choose a Llama model for text generation, summarization, or as a base model in a custom pipeline because they balance good performance with relative efficiency (the 7B model can run on a CPU with sufficient RAM, for instance). It’s noted in an integration guide that the 8B parameter variant of Meta’s Llama 3 was used as a chat model in a RAG pipeline example[40]. Meta’s models are known for robust multilingual abilities and a large knowledge base, which make them versatile.

• Code Llama: For coding tasks, Code Llama is a top choice. Code Llama is built on Llama 2 and fine-tuned for programming, making it adept at generating code, explaining code, and completing code snippets[41]. It supports many popular languages (Python, C++, Java, JavaScript/TypeScript, C#, Bash, etc.) and can output not only code but also natural language explanations about code[41]. Within Ollama, Code Llama is offered in multiple sizes: 7B, 13B, 34B, and 70B parameters[42]. Each size also has specialized variations: for instance, “CodeLlama-instruct” which is tuned to follow natural language instructions, “CodeLlama-Python” which is further refined on Python code, and “CodeLlama-code” which is the base model optimized for code completion tasks[43][44]. A researcher working on a coding assistant or looking to do code generation should use the instruct or Python variant for best results when asking the model to write new code or explain algorithms. The code completion variant can be used for IDE integration to autocomplete code. Code Llama stands out for its ability to handle structured code prompts like “fill-in-the-middle” (where you provide a prefix and suffix of code and ask the model to generate the missing part) using special tokens

… … in the prompt[45]. In summary, Code Llama is a performant model for software development tasks, making it an ideal example in our coding use case later. (We will demonstrate Code Llama usage in Section 5.1.)

• Qwen-3 Coder: Another excellent model for coding is Qwen-3 Coder, which is developed by Alibaba. Qwen (which stands for “Quantum Wen”) models are known for strong performance, and the coder variant specifically targets programming capabilities. Qwen-3 Coder is available in large sizes (e.g., 30B and up, sometimes quantized to run on smaller GPUs) and supports a long context window and agentic behavior[46]. According to the model library, Qwen3-coder excels in both coding tasks and “agentic” tasks – meaning it can follow tool-using instructions or multi-step reasoning in code-like scenarios[47]. Some users prefer Qwen-3 14B as a coding model if they have around 12GB of GPU VRAM, noting it “does decently well and can handle a big context window”[48]. The benefit of Qwen is its training on both conversational and code data, giving it versatility. In practice, if you are generating complex code or using the model in an autonomous coding agent, Qwen-3 Coder is a good candidate. It might also have different quantization options (4-bit, etc.) to run on consumer GPUs despite the large parameter count.

• Embedding Models (for RAG): For retrieval tasks, specialized embedding models are used to convert text into high-dimensional vectors. Ollama includes several ready-to-use embedding models that are optimized for semantic similarity. Three recommended ones are listed in the documentation[49]:

• EmbeddingGemma: a 300M parameter embedding model from Google[50], which produces embeddings of (typically) 768 or 1024 dimensions. It’s trained to generate vector representations suitable for semantic search. This model is small and fast, making it ideal for indexing large document sets and querying them. • Qwen3-Embedding: an embedding model derived from the Qwen-3 series, available in sizes like 0.6B, 4B, 8B parameters[51]. Qwen’s embedding model provides a range of options; the smaller ones are efficient, while the larger might yield slightly more nuanced embeddings. These are useful if you plan to use a Qwen generator model, ensuring the same “family” for embeddings (though it’s not strictly necessary to match).

• All-MiniLM: (implied by the reference to “all-minilm”) which refers to models like MiniLM or SBERT variants known for producing good embeddings quickly. Ollama likely provides a version of MiniLM (a lightweight Transformer model from Microsoft) tuned for all-around sentence embeddings. This would be a good choice when resources are very limited, as MiniLM models can be as small as 100M parameters and still give reasonable sentence similarity performance. These embedding models output a numeric vector for each input text. When building a RAG pipeline, you would use one of them to embed your documents and queries. Using the same embedding model for both indexing and querying is important to ensure vectors are comparable[52]. We will see how to generate embeddings with Ollama in the examples section (Section 5.3). In summary, embedding models in Ollama enable semantic search and retrieval — they are a key piece for applications where you need to find relevant pieces of text to feed into a larger LLM before getting an answer.

• Other Notable Models: Beyond those highlights, Ollama’s library contains many other models. A few worth mentioning include: • Mistral: A newer series of models (e.g. Mistral 7B) known for strong performance relative to size, designed for efficient on-device operation[53]. Mistral 7B can be a great general-purpose model if one needs a smaller footprint but good output quality – it’s often cited for being on par with larger 13B models in some tasks, due to excellent training.

• Steinerman StableCode: (Referred to as “Stable Code” in some sources[54]) – a 3B model fine-tuned for code completion, reportedly achieving similar capability to Code Llama 7B despite being much smaller[54]. This could be useful on hardware that can’t handle a 7B model.

• Vision-Enabled Models: like LLaVA (Large Language and Vision Assistant)[55] or Qwen-VL[56], which can analyze images along with text. If your research involves multimodal inputs (e.g., describing an image or reading text from an image), these models would be relevant. They require additional steps to provide image data (often giving a path to an image in the prompt), and Ollama supports this as shown by the example ollama run llava “What’s in this image? path/to/image.png”[57]. • Tool-using Agentic Models: Some models are fine-tuned to more readily use tools or follow chain-of-thought reasoning. For instance, Granite-4 and Nemotron series claim improved tool-use and agent capabilities[58]. These might be experimental but can be interesting if you want a model that more readily calls functions or external APIs when integrated with frameworks.

In choosing a model, consider your task and your hardware. For coding, the specialized coding models (Code Llama, Qwen Coder, etc.) will give the best results. For summarization or general Q&A, a high-quality instruct model like Gemma 3 or Llama 2 would be suitable. For RAG, pair an embedding model with a strong generator (for example, use EmbeddingGemma or Qwen-Embedding to retrieve documents, and a larger model like Llama or GPT-OSS to generate the final answer from those documents). The good news is that with Ollama, you can easily try multiple models – since the interface to run them is the same, and many can co-exist on your system as long as you have storage space. In the next section, we will walk through practical examples demonstrating how to use some of these models for code generation, text summarization, and retrieval-augmented Q&A.

18.7 Practical Examples: Using Ollama for Code, Summarization, and RAG

In this section, we illustrate how to use Ollama-hosted models for three common tasks: code generation, text summarization, and retrieval-augmented query answering. We will show example prompts and describe how the models respond, using either the command-line interface or simple scripts. These examples assume you have installed Ollama and downloaded the mentioned models as per earlier sections. 5.1 Example: Code Generation with Code Llama

One of the exciting applications of LLMs is assisting with programming tasks. Using a local model like Code Llama, you can get code suggestions or even have the model write entire functions for you. Let’s walk through an example scenario.

Task: Suppose we want to generate a function that computes the Fibonacci sequence in Python. We’ll use the CodeLlama-Instruct model (7B size) for this, as it’s tuned to follow natural language instructions and produce helpful answers including code. Ensure you have the model pulled: ollama pull codellama:7b-instruct. Then run an interactive session by entering:

ollama run codellama:7b-instruct

Now, you’ll get a prompt (maybe >>>) where you can type your query. We provide the instruction: User: “You are an expert programmer that writes simple, concise code and explanations. Write a Python function to generate the nth Fibonacci number.” When you hit Enter, Code Llama will process this prompt and begin streaming its response. Because we used the instruct variant, it will likely first explain or clarify, then present code. For example, the output might look like: Certainly! Here’s a simple Python function to compute the n-th Fibonacci number:

def fibonacci(n): “““Compute the nth Fibonacci number.”“” if n <= 0: return 0 elif n == 1: return 1 else: return fibonacci(n-1) + fibonacci(n-2)

This recursive function calls itself to sum the two preceding Fibonacci numbers until it reaches the base cases (0 or 1). (The above is a hypothetical formatted output — the actual wording may vary, but Code Llama is known to produce correct Fibonacci code in answers.)

As we see, the model not only gave us the code but also a brief explanation, because we asked it to act as an expert who writes explanations. Code Llama’s training on conversational programming data enables this kind of response. If we wanted just the code without any explanation, we could prompt it more directly, or use the codellama:7b-code variant which is meant for pure code completion. For instance, using the code completion model, one could provide a prompt like: ollama run codellama:7b-code “# A Python function to compute Fibonacci numbers:”

And the model would likely output the function code directly (possibly using the special infill tokens if needed for inserting into existing code).

Other coding uses: Code Llama can handle tasks such as debugging and code review as well. For example, you can paste a piece of code and ask: “Where is the bug in this code?”, and the model will analyze it and point out potential issues[59]. You can also ask it to write unit tests for a given function, or to provide documentation strings. Another powerful feature is fill-in-the-middle: if you have a partially written code (with a gap in the middle), you can include <PRE> and <SUF> tokens around the prefix and suffix, and <MID> where the gap is, and Code Llama’s completion model will fill that gap with appropriate code[45]. This is useful for IDE integrations where the model autocompletes code in context.

Using local code models is invaluable for researchers who want an AI coding assistant but need to maintain privacy (e.g., code for a sensitive project that cannot be sent to cloud services). While Code Llama’s performance is not quite at the level of OpenAI’s Codex or GPT-4 for code, it is quite competent for many tasks and continues to improve with community fine-tuning.

18.8 Example: Text Summarization with a Conversational Model

Summarizing text is a common task in digital research, such as condensing interview transcripts, literature, or reports. Here, we will demonstrate using an Ollama model to summarize a piece of text. Let’s assume we have a text file report.txt containing a few paragraphs of a report. Our goal is to produce a concise summary capturing the main points.

Choosing a model: We could use any good general-purpose instruct model for summarization. For this example, let’s use Gemma 3 (4B), which we have introduced earlier as a solid all-round model. Ensure it’s downloaded: ollama pull gemma3 (by default that gets the 4B version). Then run an instruction prompt. We can do this in one command by using shell piping to feed the content of the file into the model. For instance, if on a Unix-like terminal (macOS) we can do:

ollama run gemma3 “Summarize the following report:$(cat report.txt)”

This command uses command substitution to insert the entire content of report.txt into the prompt after our instruction. (On Windows PowerShell, a similar approach can be taken using Get-Content to read a file into a variable and then passing it to ollama run.) Ollama accepts multi-line prompts and large context lengths (though keep the model’s context limit in mind – many 4B models might have up to 2048 tokens context, which is a few pages of text). The model will then output a summary. For example, if report.txt was a few paragraphs about economic trends, the output might be a paragraph like: “This report discusses recent economic trends, noting that inflation has stabilized at around 3%, unemployment is slightly down, and consumer spending has increased in Q4. Key drivers include improved supply chains and strong job growth in the tech sector. The report concludes that the economy is cautiously optimistic heading into the next year, though potential interest rate hikes by the central bank could temper growth.” Again, the exact wording will vary, but the idea is that the model condenses the main points. If the initial summary is too verbose or not focused enough, one can prompt the model to be more concise: e.g., “Summarize the following text in 3-4 sentences.” The instruct-tuned models like Gemma 3 are generally good at following such directives to adjust length or style. Ollama also supports a special streaming mode, so if you were doing this interactively (without the $(cat file) trick), you could just copy-paste the text into the interactive session. Since a large paste might be cumbersome, the one-liner approach is convenient for automation. Indeed, the documentation demonstrates using ollama run llama3.2 “Summarize this file: $(cat README.md)” as a way to quickly summarize a file from the command line[60]. This could be integrated into scripts for batch summarization of documents. Summarization quality will depend on the model size and training. A 4B model can handle short texts reasonably but may miss nuances in longer documents. For more accuracy on long documents, you might consider a larger model (like a 13B or 30B model) or chunk the document and summarize in parts. There are also specialized summarizer models in the wild, though not explicitly listed in our earlier survey – some models are fine-tuned on summarization tasks or long context (for example, Llama-2 70B with 32k context could summarize very long texts if available via Ollama Cloud or similar). In any case, the workflow is: choose model -> feed text -> get summary, all without leaving your machine or sending data externally. ## Example: Retrieval-Augmented Q&A (RAG)

For our final example, we demonstrate a basic Retrieval-Augmented Generation scenario using Ollama. Imagine you have a set of documents (e.g., a collection of articles or a knowledge base for a project), and you want to ask questions that require information from those documents. A RAG pipeline would typically involve: 1. Embedding the documents into a vector database, 2. Retrieving the most relevant documents given a query, and 3. Generating an answer using an LLM, conditioning on the retrieved documents.

Let’s do a miniature version of this pipeline with Ollama. For simplicity, we’ll simulate a small “database” of just a few text snippets and show how to use Ollama’s embedding model and chat model.

Setup: Suppose we have three short documents: - Doc1: “The Eiffel Tower is located in Paris and was completed in 1889.” - Doc2: “The Great Wall of China stretches over 13,000 miles and was built over many centuries.” - Doc3: “The Pyramids of Giza in Egypt were constructed roughly between 2580 and 2510 BC.”

We load these into memory (in a real scenario, you’d have many more documents stored in a vector database like Chroma or Azure Cosmos DB, etc., but here we’ll just work with these manually). We’ll use the EmbeddingGemma model to encode these documents into vectors. Ensure the model is downloaded: ollama pull embeddinggemma. Now, using Python (with Ollama’s Python library installed via pip install ollama), we can embed and retrieve. But for clarity, let’s demonstrate using the command-line and a bit of conceptual explanation:

• Embedding the Documents: We call Ollama’s embedding API or CLI on each document. For example:

ollama run embeddinggemma "The Eiffel Tower is located in Paris and was completed in 1889."

The output will be a JSON array representing the embedding vector for that sentence[61]. Ollama’s embedding endpoint returns L2-normalized vectors (unit-length), which is convenient for cosine similarity calculations[62]. We would do the same for the other two documents. Suppose we obtain three vectors: v1, v2, v3 corresponding to Doc1, Doc2, Doc3 respectively.

• Querying: Now a user asks a question: “When was the Eiffel Tower built, and where is it located?”. We need to answer using our documents. We first embed the query in the same vector space:

ollama run embeddinggemma "When was the Eiffel Tower built and where is it located?"

This produces a vector q for the query. We then compute similarity (cosine similarity is recommended for semantic search[52]) between q and each document vector v1, v2, v3. In practice, one would use a vector database or a small script to do this calculation. Intuitively, q should be closest to v1 (which contains Eiffel Tower info) because the query is about the Eiffel Tower. So we identify Doc1 as the relevant piece.

• Retrieval: We fetch Doc1 as the top result. (If multiple docs were relevant, we might take the top 2-3 and concatenate them or otherwise provide them to the model). • Generation with Context: Now we formulate a prompt for a larger language model that includes the content of Doc1 as reference. For instance, we might prompt our chat model (say, Llama 3 8B in this example, or we could use Gemma 3 or GPT-OSS for strong performance) with something like:

Context: The Eiffel Tower is located in Paris and was completed in 1889.

Question: When was the Eiffel Tower built, and where is it located?

Answer:

Essentially, we prepend the retrieved text as “Context” and then ask the question. It’s good to clearly delineate the context and question to the model. Using the Ollama Python API, this could be done as:

from ollama import chat, ChatCompletion

messages = [ {“role”: “system”, “content”: “You are a helpful assistant that uses the provided context to answer the question.”}, {“role”: “user”, “content”: f”Context: {retrieved_doc}: {query}:“}] response = chat(model=“llama3:8b”, messages=messages) print(response.message.content)

The system message is optional but can guide the model to use context. The user message contains both context and the actual question.

When we run this, the model will generate an answer using the context. Given our context, a correct answer would be: “The Eiffel Tower is located in Paris, France, and it was completed (built) in 1889.” The model might phrase it in a full sentence: “According to the provided information, the Eiffel Tower was completed in 1889 and it is located in Paris.”

This demonstrates a simple RAG loop using Ollama: embed -> retrieve -> answer. Importantly, all steps are done locally. Embeddings were generated via Ollama’s embedding model, and the final answer came from a local LLM. If we were to scale this up, we’d use a proper vector store for efficiency. Indeed, a Microsoft Azure blog post provides a full tutorial on building a local RAG application with Ollama and Azure Cosmos DB (a NoSQL database with vector search)[63][64]. In that example, they used a high-quality 1024-dimension embedding model (mxbai-embed-large) and Meta’s Llama 3 (8B) for the Q&A, demonstrating that even an 8B model can be effective in RAG when given the right context[65]. The pipeline loads documents into the database, uses LangChain to retrieve relevant chunks via the embedding, and then calls Ollama for the final generation[66][67].

For integrating with such frameworks, read on to the next section. The key point from this example is that Ollama enables each component of RAG: you have the embedding models to vectorize text and the generative models to produce answers, all accessible through the same unified interface. Integrating Ollama into a RAG Pipeline (LangChain, LlamaIndex)

Now that we have seen an example of a RAG process, you may wonder how to implement this cleanly in code. Fortunately, popular libraries like LangChain and LlamaIndex (formerly GPT Index) have started supporting local model backends like Ollama, making it straightforward to swap in a local LLM in place of an OpenAI API call. This section discusses how one can integrate Ollama models into these frameworks, allowing complex pipelines with minimal glue code.

LangChain Integration: LangChain is a framework that provides abstractions for LLMs, prompts, memory, tools, and chains of calls. Ollama integration with LangChain is available and documented by the LangChain team[68]. There are essentially two types of integration components: - Chat Model: ChatOllama – which allows using an Ollama model as a conversational agent (following the chat interface with roles system). - Embedding Model: OllamaEmbeddings – which allows using an Ollama embedding model as a vectorizer within LangChain’s retrieval or vector store utilities.

To use these, you typically need to install an integration package (for example, pip install langchain-ollama might be required, as indicated by the existence of a PyPI package[69]). Once set up, you can do something like:

from langchain.llms import Ollama llm = Ollama(model=“llama3:8b”) # instantiate a LangChain LLM pointing to local Ollama model response = llm(“Hello, how are you?”)

This would send the prompt to Ollama and return the response as a string. Under the hood, this uses Ollama’s API endpoint. You can also specify parameters like model_kwargs={“stream”: False} or similar if needed. For embeddings:

from langchain.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings(model=“embeddinggemma”) vec = embeddings.embed_query(“sample text to vectorize”)

This will return a vector (list of floats) by calling Ollama’s embed API behind the scenes[70]. You can then use these vectors with LangChain’s vector store classes (like FAISS, Chroma, etc.) to build an index.

LangChain’s documentation confirms that Ollama allows you to run open-source models locally and that it’s supported as a provider[68]. The integration means you can build a chain where, for example, a RetrievalQA chain uses OllamaEmbeddings for the retriever to embed the query and docs, and ChatOllama as the LLM to generate the final answer. This yields a fully self-contained pipeline on your machine. In fact, the Microsoft sample mentioned before uses LangChain with Ollama and a local database[67]. They show that after setting up Ollama and the DB, only minimal LangChain code is needed to orchestrate the RAG: LangChain handles splitting documents, creating embeddings via Ollama, storing them in the vector DB, and then on a query, retrieving and feeding context to the Ollama chat model[66][71].

LlamaIndex Integration: LlamaIndex is another library that simplifies connecting LLMs with your data (documents, SQL, etc.), often by building indices that can be queried. While as of this writing there may not be a native Ollama integration in LlamaIndex, it can work with any LLM that you can call programmatically. You could use the Ollama Python client or HTTP API to define a custom LLM class in LlamaIndex. The concept is similar: during index construction, use an Ollama embedding model to embed text; during query, use an Ollama chat model to get answers. The LangChain example can be translated logically to LlamaIndex by specifying a custom LLMPredictor that calls Ollama. The Azure blog post notes that the sample app was easily adaptable to LlamaIndex, meaning that swapping LangChain’s mechanism with LlamaIndex’s is straightforward once Ollama is accessible in code[64].

In summary, integrating Ollama into these pipelines typically involves pointing the framework to Ollama’s local API. Both LangChain and LlamaIndex are designed to be modular with respect to the LLM backend. By using the official or community-supported connectors (as with LangChain’s ChatOllama and OllamaEmbeddings), you can develop complex applications – like a chatbot that can cite from PDFs or a code assistant that refers to documentation – all running on local models provided by Ollama. This aligns with the interest of many researchers in maintaining control and privacy: your data and queries never leave your machine, yet you harness powerful AI capabilities.

18.9 Ollama API and Programmatic Usage

Ollama not only provides a CLI but also exposes a local HTTP API that developers can use to interact with the models from any programming language. This design makes it easy to integrate Ollama into custom applications, beyond the high-level frameworks discussed earlier. In this section, we describe how the API works and give examples of using it.

Local REST API: By default, once the Ollama service is running, it listens on http://localhost:11434 for API requests[5]. There are multiple endpoints, corresponding to different operations: - POST /api/generate – for one-off text generation given a prompt (suitable for single-prompt completions). - POST /api/chat – for conversational interactions (allows multi-turn conversations by sending a list of messages with roles). - POST /api/embed – for generating embeddings from input text. - (There may be additional endpoints for listing models or other management functions, but the above are the primary ones for usage.) These endpoints expect JSON payloads and return responses typically in JSON format. Let’s look at each briefly:

• /api/generate: This is a simple endpoint where you provide a model and a prompt, and you get back a completion (the model’s generated text). For example, using curl you could do:

curl -X POST http://localhost:11434/api/generate -d '{
"model": "codellama",
"prompt": "Write me a function that outputs the Fibonacci sequence",
"stream": false

}’

This JSON payload specifies the model (codellama defaults to its base variant or instruct variant depending on config), the prompt string, and "stream": false indicating we want the full output in one response rather than streamed chunks. The response will be a JSON object that includes the model’s output text. In the Code Llama model card example, they show a similar usage of /api/generate to get a code snippet back[72]. If "stream": true is set, the API would send events or chunks that you’d have to read in sequence (usually via an HTTP chunked response). Streaming is useful for interactive applications where you want to show the answer as it is being generated.

• /api/chat: This endpoint is used for multi-turn conversations. The payload format includes a list of messages, each with a “role” (such as “system”, “user”, “assistant”) and “content” (the text of the message). For instance:

“model”: “gemma3”, “messages”: [ {“role”: “system”, “content”: “You are a helpful tutor.”}, {“role”: “user”, “content”: “Hello there!”} ], “stream”: false }

Posting this to /api/chat will get the model’s reply as if it were continuing a chat, considering the roles. The first message sets the context or persona, and the second is the user’s input. The API will return an assistant message as the completion. The Quickstart in the documentation demonstrates this usage: they send a user message “Hello there!” and get a response[31]. This interface aligns with the ChatGPT-style of interaction and supports maintaining a conversation by appending to the messages list. Note that Ollama does not inherently store the conversation between calls; the client needs to send the full message history each time (unless using some stateful feature, but stateless is the typical approach). So to have a back-and-forth, you would keep accumulating the conversation in the messages array and call /api/chat with it each time the user adds a new query.

• /api/embed: This endpoint converts input text to embedding vectors. The JSON payload expects an “input” field which can be either a single string or an array of strings, and a “model” field specifying which embedding model to use. For example:

“model”: “embeddinggemma”, “input”: [“The quick brown fox jumps over the lazy dog.”] }

Posting this to /api/embed will return a JSON with an "embeddings" field, which is an array of vectors (each vector itself being an array of numbers)[73]. In this case, since one sentence was given, the result might look like {"embeddings": [[0.0123, 0.085, ... ]]} – one vector inside an outer array. If you send multiple inputs (batch), you’ll get a vector for each input in order. Ollama ensures the vectors are normalized (unit length)[62]. Using curl, an example call from the docs is:

curl -X POST http://localhost:11434/api/embed \
 -H "Content-Type: application/json" \
 -d '{"model": "embeddinggemma", "input": "The quick brown fox jumps over the lazy dog."}'
 
which would yield the JSON embedding[74]. Programmatically, one can use the Python or JavaScript library to call this more easily.

Ollama Client Libraries: To simplify API usage, the Ollama team provides official libraries, e.g., a Python package and a NodeJS package[75]. These wrap the REST calls in convenient functions. For instance, in Python after pip install ollama, you can do:

from ollama import generate, chat, embed result = generate(model=“codellama”, prompt=“Write a haiku about the ocean.”) print(result)

This generate call would internally call the /api/generate endpoint and return the text (or a structured object). The documentation’s quickstart provides an example using the chat() function in Python to have a model answer “Why is the sky blue?”[76]. Similarly, the JavaScript library (via npm i ollama) lets you call ollama.chat({…}) or ollama.embed({…}) for NodeJS applications[77][78]. These libraries handle constructing the HTTP requests and parsing the responses, so you as a developer can focus on the content of the interaction.

Usage Example via API: Suppose you want to integrate Ollama into a simple web application. You could set up a backend route that, when it receives a request (say from a user form), it then calls Ollama’s API and returns the result. For example, a Flask (Python) endpoint might do:

(app.route?)(‘/ask’, methods=[‘POST’]) def ask(): user_query = request.json[‘query’] # We maintain a conversation list in session or elsewhere conversation = session.get(‘messages’, []) conversation.append({“role”: “user”, “content”: user_query}) # Always include a system prompt for consistent behavior system_prompt = {“role”: “system”, “content”: “You are a helpful assistant.”} payload = {“model”: “gpt-oss:20b”, “messages”: [system_prompt] + conversation, “stream”: False} response = requests.post(“http://localhost:11434/api/chat”, json=payload) bot_reply = response.json()[‘message’][‘content’] conversation.append({“role”: “assistant”, “content”: bot_reply}) session[‘messages’] = conversation # save updated conversation return {“answer”: bot_reply}

This pseudo-code shows how one might manage conversation state and call the API. We chose gpt-oss:20b as a model to have a powerful general answerer. The pattern would be similar for single-turn use (just omit the history and use /api/generate if you prefer). Performance Considerations: When using the API, remember that each call will engage the model for inference. For larger models, this can take several seconds (or more) per request, depending on hardware and prompt length. The API is stateless between calls (unless one uses the conversation in each request as shown). If you expect to handle many requests (like serving multiple users), be mindful of the model’s loading: Ollama’s server can handle concurrent requests up to some limit but running many large models at once can strain system resources. It might be wise to use a moderate-sized model for interactive use to keep latency reasonable. Closing the Loop – Example of Code via API: Let’s tie back to one of our earlier examples but using the API. For code generation, we could directly call the API without using the CLI. For instance, on Windows, using PowerShell, one could run:

(Invoke-WebRequest -Method POST -Body '{"model":"codellama:7b-instruct", "prompt":"Write a function to check if a number is prime.", "stream": false}' -Uri http://localhost:11434/api/generate).Content | ConvertFrom-Json

This PowerShell snippet hits the generate endpoint and then converts the JSON output to an object for easy reading[79]. The result would contain the code for a prime-checking function. As shown in the docs, the output is piped to ConvertFrom-Json to parse the JSON into a PowerShell object[79]. On other platforms, a similar curl command or use of the client library yields the result.

In conclusion, Ollama’s API empowers integration and automation. Whether you are writing a small script, integrating into a research pipeline, or building an interactive app, you can treat Ollama’s local server much like you would treat an external AI service – except it’s all local. With endpoints for chat, generation, and embedding, it covers the main functionalities needed for intelligent applications. Coupled with the earlier sections on installation and models, you should now have a comprehensive understanding of how to set up, use, and programmatically leverage Ollama to run large language models on Windows or macOS, enabling advanced AI-driven research while retaining control over your tools and data.

18.10 Conclusion

In this chapter, we have provided a detailed overview of installing and using Ollama on Windows and macOS, and demonstrated its role in bringing powerful large language models to your local machine. Ollama serves as a bridge between cutting-edge AI models and everyday researchers – by simplifying installation, it allows those without deep technical expertise to experiment with LLMs in a secure, offline environment. We covered step-by-step installation processes, noting important considerations like system requirements, disk space, and configuration for both Windows and Mac. We then explained Ollama’s architecture and how it supports local model serving with a user-friendly CLI and API, highlighting the advantages of local deployment (privacy, control, customization).

A survey of available models in the Ollama library showed the breadth of tasks you can tackle: from general-purpose dialogue (Gemma 3, GPT-OSS, Llama series) to specialized coding assistants (Code Llama, Qwen Coder) and components for retrieval systems (embedding models). Citing documentation and examples, we illustrated that many of these models rival the performance of cloud APIs, especially as open research rapidly improves LLM capabilities. By walking through practical examples – generating code, summarizing text, and performing retrieval-augmented Q&A – we translated the abstract capabilities into concrete workflows. These examples serve as templates that researchers can adapt to their own projects, whether it’s automating parts of coding, digesting large documents, or building a custom QA bot for a dataset.

We also delved into integration with high-level frameworks like LangChain and LlamaIndex, emphasizing that Ollama can seamlessly plug into existing AI pipelines. This means you can incorporate local models into your research applications with minimal changes, benefiting from the ecosystem of tools built around LLMs. Finally, we detailed the Ollama API, with examples of how to call it and what responses to expect, reinforcing the idea that anything you can do with a cloud AI service, you can similarly do with Ollama via localhost – making it a versatile component in your computational toolkit.

In scholarly contexts, the ability to run large language models locally is opening new avenues for experimentation in digital humanities, social science data analysis, and beyond. It enables compliance with data governance (since sensitive data never leaves the lab), and fosters reproducibility (since models and code can be shared and run without dependence on external services). As the landscape of open LLMs evolves, tools like Ollama will likely expand to support new models and features, perhaps larger context windows, more optimized runtimes, or collaborative features. By mastering the installation and usage of Ollama now, researchers equip themselves to take advantage of these advancements on their own terms.

18.11 References

• Ollama Documentation and Model Library – providing installation guides, model cards, and usage examples[5][41][61]. • Microsoft Azure Cosmos DB Blog by A. Gupta (2025) on Building a RAG application with Ollama – illustrating a practical local RAG implementation and the benefits of local LLMs[63][4]. • LangChain Documentation – confirming integration of Ollama for both chat and embeddings within standard LLM workflows[68][70]. • Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. – foundational paper introducing the RAG approach that underpins the retrieval examples discussed[6]. • Code Llama Model Card (Meta AI, 2023) – describes the capabilities of Code Llama in generating and understanding code, which aligns with our coding example[41][80]. These references and the citations throughout the chapter provide additional details and validation for the concepts covered. By following this guide and exploring the cited materials, researchers should be well-prepared to harness the power of local LLMs via Ollama for a wide range of interdisciplinary applications.

[1] [2] [4] [28] [40] [63] [64] [65] [66] [67] [71] Build a RAG application with LangChain and Local LLMs powered by Ollama - Azure Cosmos DB Blog https://devblogs.microsoft.com/cosmosdb/build-a-rag-application-with-langchain-and-local-llms-powered-by-ollama/ [3] [75] Ollama’s documentation - Ollama https://docs.ollama.com/ [5] [7] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [29] [79] Windows - Ollama https://docs.ollama.com/windows [6] [2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks https://arxiv.org/abs/2005.11401 [8] [27] [30] [35] [38] [55] [57] [60] GitHub - ollama/ollama: Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. https://github.com/ollama/ollama [19] [20] [21] [22] [23] [24] [25] [26] macOS - Ollama https://docs.ollama.com/macos [31] [32] [33] [76] [77] Quickstart - Ollama https://docs.ollama.com/quickstart [34] [37] [46] [47] [50] [51] [53] [56] [58] Ollama https://ollama.com/search [36] deepseek-r1 - Ollama https://ollama.com/library/deepseek-r1 [39] r/ollama on Reddit: Got DeepSeek R1 running locally - Full setup … https://www.reddit.com/r/ollama/comments/1i6gmgq/got_deepseek_r1_running_locally_full_setup_guide/ [41] [42] [43] [44] [45] [59] [72] [80] codellama https://ollama.com/library/codellama [48] 3 ways to interact with Ollama | Ollama with LangChain - YouTube https://www.youtube.com/watch?v=cEv1ucRDoa0 [49] [52] [61] [62] [73] [74] [78] Embeddings - Ollama https://docs.ollama.com/capabilities/embeddings [54] library - Ollama https://ollama.com/library?q=code [68] [70] Ollama - Docs by LangChain https://docs.langchain.com/oss/python/integrations/providers/ollama [69] langchain-ollama - PyPI https://pypi.org/project/langchain-ollama/

# Installing and Using Ollama on Windows and macOS ## Introduction: Local Large Language Models and Ollama Local large language models (LLMs) have emerged as powerful tools that run entirely on personal hardware, offering benefits like enhanced data privacy and offline capability[1]. By keeping sensitive data on local machines and not relying on cloud services, researchers can experiment with AI models securely and without internet access[1]. In recent years, an ecosystem of user-friendly frameworks has grown to support local LLM deployment. Notable examples include Ollama, Foundry Local, and Docker’s Model Runner[2]. These frameworks make it easier to download and run advanced language models on commodity hardware. Among them, Ollama has gained prominence as one of the easiest ways to get up and running with large language models such as GPT‑OSS, Gemma 3, DeepSeek‑R1, Qwen3, and more[3]. Ollama is an open-source tool and runtime that allows users to run open-source LLMs locally using simple commands, without the need for cloud APIs or complex setup[4]. In essence, Ollama provides a unified platform to download, manage, and interact with a variety of LLMs on your own machine[4]. It supports models for general text generation, coding assistance, multimodal (vision-language) tasks, and even embeddings for retrieval. Under the hood, Ollama runs a local server that hosts these models and exposes an API, enabling both command-line and programmatic interactions. Researchers and practitioners can prompt the models directly via a CLI or integrate them into applications through standard API calls, achieving functionality similar to cloud-based AI services but entirely on local infrastructure[5]. This chapter provides step-by-step guidance on installing Ollama on Windows and macOS, explains how Ollama facilitates local LLM usage, surveys some high-performing models available through Ollama (with a focus on coding and retrieval-augmented generation tasks), and presents practical examples of using these models. We also discuss how to integrate Ollama-hosted models into a Retrieval-Augmented Generation (RAG) pipeline with frameworks like LangChain or LlamaIndex, and we describe the Ollama API for programmatic access. The writing is aimed at interdisciplinary researchers in digital and data science fields, presented in an academic style with references to documentation and relevant literature. Before diving into installation, it is useful to clarify what we mean by retrieval-augmented generation (RAG), as this is a recurring theme. RAG is a technique where a language model is combined with an external knowledge repository or database to enhance its factual accuracy and up-to-date knowledge[6]. In a RAG pipeline, the model retrieves relevant documents (often via vector similarity search) and uses their content as context when generating answers[6]. This approach has been shown to improve performance on knowledge-intensive tasks by grounding the model’s outputs in external information[6]. Ollama can play a critical role in RAG setups by providing both the embedding models for document retrieval and the generative models for answering questions, all running locally on a researcher’s computer. With this context in mind, we will start with how to install Ollama on the two major operating systems, and then explore its capabilities in detail. ## Installing Ollama on Windows and macOS Installing Ollama is straightforward on both Windows and macOS, with pre-built installers provided for each platform. In this section, we outline system requirements and guide you through the installation steps on Windows (Section 2.1) and macOS (Section 2.2). By the end of this section, you will have the Ollama application installed and running, ready to download and serve local LLMs. ## Installation on Windows System Requirements: To run Ollama on Windows, you should have Windows 10 (version 22H2 or newer) or Windows 11, either Home or Pro edition[7]. Ollama supports both NVIDIA and AMD GPUs on Windows, so if you plan to use GPU acceleration, ensure you have a compatible driver: NVIDIA drivers version 452.39 or newer, or the latest AMD Radeon drivers for your card[7]. Older GPUs without the required CUDA support may fall back to CPU execution. While Ollama does not require a GPU (it can run on CPU), using a modern GPU with sufficient VRAM is recommended for larger models and better performance. Keep in mind that running large models also demands significant RAM; for example, at least 8 GB RAM is recommended for 7B-parameter models, 16 GB for 13B models, and 32 GB for 30B+ models[8]. ## Installation Steps: 1. Download the Installer: Visit the official Ollama website and navigate to the Download page. Download the Windows installer (usually provided as OllamaSetup.exe) for the latest version of Ollama. This installer bundles the Ollama application and required dependencies. No administrative privileges are required – by default, it will install to your user’s home directory[9]. 2. Run the Installer: Locate the downloaded OllamaSetup.exe and double-click it. Follow the prompts in the setup wizard. The installer will copy the necessary files into your home directory (under %LOCALAPPDATA%\Programs\Ollama by default) and register the Ollama application. It will also add the ollama command-line tool to your PATH, so that it can be accessed from Command Prompt, PowerShell, or any terminal[5][10]. Because the installation is per-user, you won’t be prompted for Administrator approval. 3. Start Ollama (if not started automatically): After installation, Ollama will run a background service automatically. In Windows, Ollama typically adds a system tray icon and launches its background server on start-up[5]. The Ollama service exposes a local API at http://localhost:11434 and listens for model requests[5]. If the installer finishes without error, the service should be running. You can verify this by opening a terminal and running a quick command such as: ollama run hello-world (Here hello-world is an example model or prompt; you might substitute a small model name if needed.) If the service is running, the command will either begin downloading the model or produce a response. In case the service isn’t running, you can start it via the Start menu (find Ollama and launch it) or by running ollama serve in PowerShell to launch the server manually. 4. Disk Space Considerations: The base Ollama installation requires roughly 4 GB of disk space for binaries and core files[9]. However, this does not include the space for models. Each downloaded model can range from a few gigabytes to tens or even hundreds of gigabytes for the largest models[11]. By default, models are stored in your user profile directory (e.g., %HOMEPATH%\.ollama\models or a similar path) along with configuration. Ensure you have adequate storage in this location, or consider changing it (next step) if not. 5. (Optional) Change Installation or Model Storage Location: If you need to install the Ollama program files to a different drive (for example, a larger secondary drive), you can specify a custom directory when running the installer. From a command line, run the installer with the **/DIR="D:\path\to\folder" flag to choose an alternate installation directory[12]. Likewise, to store downloaded model files elsewhere (e.g., on an external drive or another folder with more space), set the environment variable **OLLAMA_MODELS** to your desired path before running Ollama[13]. You can set this in Windows by searching “Edit environment variables for your account” and adding a new user variable named OLLAMA_MODELS with the target directory path[14]. After changing this, restart the Ollama service (exit the tray app and re-launch it, or reboot) to have it pick up the new model location[15]. These steps ensure large model files won’t fill up your main drive. 6. Finish Installation: Once installed and configured, you can test Ollama by pulling a small model and running it. For example, open a PowerShell and execute: ollama pull gemma3:1b # download a 1B-parameter model (Gemma 3 variant) ollama run gemma3:1b # run the model with an interactive prompt The above commands will download the Gemma 3 model (1-billion parameter version, ~800 MB) and then run it so you can enter queries. You should see Ollama begin to download model weights if not already present, then eventually provide a prompt where you can type a question. This verifies that Ollama is functioning correctly on your Windows system. On first run, the download can take some time; subsequently, the model is cached locally for reuse. By default, Ollama’s Windows service continues running in the background, so you can invoke ollama commands anytime without manually restarting the server[5]. If needed, you can quit the Ollama background application via the system tray or using the Task Manager, and uninstall it through Add or Remove Programs in Settings (Ollama’s installer registers an uninstaller for easy removal)[16]. Note: Advanced users who wish to run Ollama as a standalone CLI without the GUI/tray can download a ZIP archive containing the Ollama CLI and required libraries (for instance ollama-windows-amd64.zip from the releases). Extracting this will give you the ollama binary which can be run directly or as a system service (e.g., via NSSM) using the ollama serve command[17]. However, for most users the standard installer is recommended, as it includes an auto-update mechanism to keep Ollama up-to-date with the latest improvements and model support[18]. 2.2 Installation on macOS System Requirements: Ollama supports macOS Sonoma (v14) or later[19]. It works on both Apple Silicon (M1/M2 series chips) and Intel x86_64 Macs, although performance will be significantly better on Apple’s M-series chips due to their powerful neural processing capabilities. On Apple Silicon Macs, Ollama can utilize the GPU and neural engine for acceleration, whereas on Intel Macs it will run using CPU only[20]. Ensure you have at least a few gigabytes of free disk space for the installation and additional space for model files (which, as on Windows, can be tens of GB)[21]. Installation Steps: 1. Download the Installer: From the Ollama website, download the macOS installer, which is provided as a disk image file (.dmg). Make sure to grab the latest release compatible with your macOS version. 2. Install the Application: Open the downloaded .dmg file. This will mount a virtual disk and open a Finder window. Inside, you should see the Ollama application icon. Install Ollama by drag-and-dropping the Ollama app into your system’s Applications folder[22]. This copies the app to your machine. Once copied, you can eject the disk image. 3. First Launch and CLI Setup: Navigate to your Applications folder and launch Ollama.app. On first run, you may be prompted with a dialog asking if you want to move the app to the Applications folder (if you launched it from elsewhere) – since we already placed it in Applications, you can confirm or ignore as appropriate. The Ollama app will then verify whether the command-line tool ollama is accessible via your shell’s PATH. If it is not found, Ollama will prompt for permission to create a symlink to its internal CLI binary in /usr/local/bin (which is a standard location for user binaries on macOS)[21]. Granting permission will allow you to use the ollama command in Terminal without specifying the full path. After this one-time setup, the ollama CLI becomes available system-wide. 4. Background Service: Similar to Windows, the macOS installation runs a background server. When you open the Ollama app, it will typically start a menubar icon (look for an Ollama logo in the macOS menu bar) and begin running the local model server in the background. The server listens at http://localhost:11434 for API requests, just as on Windows. You do not need to manually start it each time; launching the app once (or logging in, if you set it to start on login) is sufficient. The CLI commands you run in Terminal will communicate with this background service. If needed, you can stop the service by quitting the Ollama app (for example, via the menubar icon’s menu or ⌘+Q if the app has a Dock icon) – though generally you’ll keep it running while working with models. 5. Disk Space and Model Storage: By default, Ollama on macOS will store model files and configuration in your home directory under ~/.ollama[23]. As noted, models can be large (several GB each), so ensure your Mac has sufficient storage in the home volume. If your home directory is low on space, you have a couple of options. One is to change the model storage directory by setting the OLLAMA_MODELS environment variable to a folder on a drive with more space (e.g., an external drive or secondary volume). You would do this by editing your shell profile (like ~/.zshrc or ~/.bash_profile) to export OLLAMA_MODELS="/Path/To/ModelStorage" and then relaunching the Ollama app so it picks up the new location. Another option, if you prefer to keep the entire application elsewhere, is to install Ollama in a custom location: you can drag the Ollama app to a different folder instead of Applications, but you must then ensure the CLI symlink points correctly to Ollama.app/Contents/Resources/ollama in that location[24]. On first launch, if you chose a non-standard install location, decline any prompt to “Move to Applications?” to keep it where it is[25]. The key is that the ollama binary (inside the app package) needs to be on the PATH. Advanced users can manage this manually (e.g., by adding a symbolic link in /usr/local/bin themselves). 6. Completing Installation: With the app running and the CLI configured, you can now open Terminal (or iTerm, etc.) and test that everything works. Try running: ollama pull codellama:7b-instruct # download Code Llama 7B (in instruct variant) ollama run codellama:7b-instruct # run the model with an example prompt The first command will fetch the model weights for the 7B parameter Code Llama model (which might be around 3–4 GB download) and the second will start an interactive session with that model. You could, for instance, type a request like “Write a Python function to compute the Fibonacci sequence” and observe the model generating code (more on Code Llama in Section 4.2). This confirms that Ollama is properly installed and that the CLI communicates with the background server to generate answers. Uninstallation: If you ever need to remove Ollama from macOS, you will have to delete the application and associated files manually (since macOS apps often do not come with an uninstaller). This involves deleting the Ollama.app from the Applications folder and removing the support files. According to the documentation, you should remove the symlink (/usr/local/bin/ollama if it exists) and the data directories such as ~/Library/Application Support/Ollama, caches (under ~/Library/Caches), saved state, and the ~/.ollama folder which contains downloaded models[26]. Be cautious when deleting these to avoid removing other needed files. After this, Ollama will be fully removed from your system. Linux (Note): While this chapter focuses on Windows and macOS, it is worth noting that Ollama also supports Linux. Installation on Linux is typically done via a single shell script command (for example, using cURL to run the install script)[27]. This makes it easy to set up on servers or research clusters running Linux. The Linux version runs as a command-line service (ollama serve) and supports similar capabilities. Researchers comfortable with a Linux environment can thus also leverage Ollama in that context, for instance on a GPU-equipped Linux server. With Ollama installed on your platform of choice, we can now explore what Ollama offers and how to make use of it. ## Understanding Ollama: Running Local LLMs Made Easy Once installed, Ollama provides a consistent environment for working with local LLMs. Its design centers around a client-server model: the background service (or “Ollama server”) loads and serves the models, while the user interacts through the CLI or API. This architecture means you can start the service once and then issue multiple queries to models without reloading them each time, which improves efficiency for iterative work. Ollama Service and CLI: On both Windows and macOS, after installation, Ollama runs a local server process in the background[5]. The server is responsible for model management and inference. The command-line interface ollama that you use in Terminal or Command Prompt is essentially a client that sends requests to this server. By default, the server listens at http://localhost:11434 (11434 is the default port) for API calls[5]. When you run a command like ollama run <model> "<prompt>", the CLI either fetches the model (if not already downloaded) and then sends your prompt to the local API, which runs the model to generate a response. This design abstracts away the complexities of model loading and hardware acceleration – you simply issue commands and get answers, while under the hood Ollama manages GPU/CPU utilization and model inference. Hardware Acceleration: Ollama supports running models on both CPUs and GPUs. If a compatible GPU is available (as per the system requirements), Ollama will attempt to use it for model inference to significantly speed up generation. For instance, on Windows it supports NVIDIA CUDA and AMD ROCm for GPU acceleration[7]. On macOS, Apple’s Metal Performance Shaders are used under the hood for Apple Silicon GPUs, enabling efficient execution on M1/M2 chips. The use of quantized model formats (such as GGUF/GGML quantized weights) also allows large models to run with less memory than full-precision, making local execution more feasible. However, it is important to set expectations according to your hardware: a very large model (tens of billions of parameters) might run slowly or not load at all on a laptop with limited RAM or without a strong GPU. Ollama’s documentation notes that lightweight models (e.g. 7B parameters) can run on modern CPUs with ≥8 GB RAM, whereas larger models like 70B parameters generally require a GPU with at least 16 GB VRAM to run comfortably[28]. If you attempt to run a model beyond your hardware’s capability, you may experience slow responses or out-of-memory errors. Storage and Model Management: By default, Ollama stores models and related data in a dedicated directory (~/.ollama on Unix-like systems, or %HOMEPATH%\.ollama on Windows)[29]. Each model is typically downloaded as a set of weight files (often in a compressed format specialized for efficient inference). The first time you run or explicitly pull a model, Ollama will download these files from the internet (Ollama’s model repository). This one-time download can be time-consuming for large models but subsequent uses are offline. It’s worth emphasizing that model files can be very large – for example, a 30B parameter model might be tens of gigabytes in size[11]. Plan your storage accordingly. If needed, you can configure a custom model directory as described in the installation section, using the OLLAMA_MODELS environment variable to point to a path on a drive with sufficient space[13]. Ollama’s CLI provides a few key commands for model management: ollama pull <model> to download a model, ollama run <model> to run it (pulling it first if necessary), and ollama rm <model> to remove a model from disk if you want to free space. You can list the models you have by checking the ~/.ollama directory or using ollama commands (as of writing, there isn’t a single command like ollama list mentioned in docs, but one can use the file system or forthcoming features). The Ollama library website provides a catalog of available models that you can search and pull[30]. Popular models often have shorter aliases or simple names, while specific variants are accessed with syntax like model:variant or model:size. Capabilities and Extensions: Ollama is not just a basic wrapper; it includes advanced features to enhance the model usage experience. For example, it supports streaming output, meaning it can stream the model’s reply token-by-token to your terminal or application (much like how ChatGPT streams answers) for responsiveness[31]. It also has support for special prompting techniques: the documentation mentions “thinking” mode (which likely relates to chain-of-thought prompting), “structured outputs” (to enforce JSON or specific formats in responses), “vision” (for image inputs with vision-capable models), “embeddings” (generating text embeddings for retrieval tasks), “tool calling” (models can call external tools or functions), and “web search” integration[32][33]. These capabilities enable complex use cases such as using a local model to answer questions by searching the web, or to output a JSON structure reliably by following a system schema. The details of these are beyond the scope of this chapter, but it’s notable that Ollama aims to support many of the advanced features one might expect from cloud-based AI APIs. For most general researchers, the main takeaway is that Ollama simplifies local AI model usage. You do not need to worry about converting models to specific formats or writing custom code to run them – simply install Ollama, use ollama pull to get a model, then ollama run or via API to query it. In the next section, we will discuss some of the well-performing models available via Ollama and how they align with different tasks. This will help you decide which models to use for coding tasks, text summarization, or implementing a retrieval-augmented pipeline. ## Models Available in Ollama and Their Use Cases One of Ollama’s strengths is its curated library of open-source models. Through Ollama, you can access a variety of cutting-edge LLMs without manually hunting down model files or worrying about compatibility. The models span various sizes (from a few hundred million parameters up to hundreds of billions) and are fine-tuned for different purposes such as general conversation, coding, knowledge retrieval, and multimodal understanding. Below we describe a few notable models that are available, focusing on those particularly useful for coding and for RAG (retrieval-augmented generation) scenarios. Each description includes the model’s intended purpose and reasons it might be chosen for a given task, with references to documentation or model cards. • GPT-OSS: GPT Open-Source Series. GPT-OSS is a family of open-weight models that aim to approximate the capabilities of proprietary GPT-series models using fully open-source weights[34]. These models are designed for powerful reasoning, tool use, and versatile language tasks[34]. They come in large parameter counts (e.g., 20B or 120B) and have broad knowledge. GPT-OSS is a good choice when you need strong general-purpose performance and are okay with running a larger model on a capable machine. Researchers have used GPT-OSS models for complex question answering and agentic tasks where the model might need to follow multi-step instructions or call external tools (with proper prompting). Essentially, GPT-OSS is an open substitute for an “GPT-3.5/GPT-4”-like capability in a local environment. • Gemma 3: Gemma 3 is an open model (originating from Google research) known for solid instruction-following and general knowledge. It comes in various sizes such as 1B, 4B, 12B, and 27B parameters[35], allowing users to trade off performance versus speed. Gemma 3 has been highlighted as one of the models that approaches leading models in capability; indeed, it’s mentioned that Gemma 3, along with DeepSeek, offers performance close to top models like O3 (OpenOrca) or Google’s Gemini 2.5[36]. In practice, Gemma 3 is suitable for conversational agents and text generation where a robust, instruction-tuned model is needed. It may not be as well-known as Llama or GPT, but it represents a high-quality open model for everyday tasks. Additionally, specialized versions exist, such as FunctionGemma, a fine-tuned variant of Gemma 3 (270M parameters) specifically for function (tool) calling[37] – useful if one wants the model to reliably call APIs or functions in an agent setting. • DeepSeek‑R1: This is a family of open reasoning models that has gained attention for its strong performance on complex tasks. DeepSeek‑R1 models are designed to exhibit advanced reasoning and problem-solving, reportedly approaching the performance of leading-edge models in the same class[36]. For example, a DeepSeek‑R1 7B model can often compete with larger models in logical reasoning tasks due to its fine-tuning. Ollama lists DeepSeek‑R1 in sizes from 7B up to an extremely large 671B (the latter being a research-grade model requiring massive hardware)[38]. In typical usage, one might try the 7B or smaller variants for experimentation. These models can be useful for RAG pipelines as the “brain” that does the final answer generation after retrieval, especially if the question requires multi-hop reasoning or complex deduction. Thanks to their optimization for reasoning, they might handle chain-of-thought prompts well. A Reddit user described successfully running DeepSeek R1 locally with Ollama and noted that bigger models yield smarter AI responses (albeit at the cost of speed)[39]. • Llama 2 and Llama 3 series: Meta’s Llama family is well-represented in Ollama’s library, including not only Llama 2 (which was released openly by Meta) but also newer iterations referred to as Llama 3.x. For example, the Llama 3.2 and Llama 3.3 models appear, and even a placeholder for Llama 4 in early form[38]. These are cutting-edge open models with strong general performance. Llama 2 (especially the 7B and 13B sizes) became a popular baseline for many applications, and fine-tuned variants (like Llama 2 Chat) are available for dialogue tasks. Llama 3 (if referring to research versions that may be on the way or open reproductions) suggests even greater capability. In an Ollama context, you might choose a Llama model for text generation, summarization, or as a base model in a custom pipeline because they balance good performance with relative efficiency (the 7B model can run on a CPU with sufficient RAM, for instance). It’s noted in an integration guide that the 8B parameter variant of Meta’s Llama 3 was used as a chat model in a RAG pipeline example[40]. Meta’s models are known for robust multilingual abilities and a large knowledge base, which make them versatile. • Code Llama: For coding tasks, Code Llama is a top choice. Code Llama is built on Llama 2 and fine-tuned for programming, making it adept at generating code, explaining code, and completing code snippets[41]. It supports many popular languages (Python, C++, Java, JavaScript/TypeScript, C#, Bash, etc.) and can output not only code but also natural language explanations about code[41]. Within Ollama, Code Llama is offered in multiple sizes: 7B, 13B, 34B, and 70B parameters[42]. Each size also has specialized variations: for instance, “CodeLlama-instruct” which is tuned to follow natural language instructions, “CodeLlama-Python” which is further refined on Python code, and “CodeLlama-code” which is the base model optimized for code completion tasks[43][44]. A researcher working on a coding assistant or looking to do code generation should use the instruct or Python variant for best results when asking the model to write new code or explain algorithms. The code completion variant can be used for IDE integration to autocomplete code. Code Llama stands out for its ability to handle structured code prompts like “fill-in-the-middle” (where you provide a prefix and suffix of code and ask the model to generate the missing part) using special tokens <PRE> ... <SUF> ... <MID> in the prompt[45]. In summary, Code Llama is a performant model for software development tasks, making it an ideal example in our coding use case later. (We will demonstrate Code Llama usage in Section 5.1.) • Qwen-3 Coder: Another excellent model for coding is Qwen-3 Coder, which is developed by Alibaba. Qwen (which stands for “Quantum Wen”) models are known for strong performance, and the coder variant specifically targets programming capabilities. Qwen-3 Coder is available in large sizes (e.g., 30B and up, sometimes quantized to run on smaller GPUs) and supports a long context window and agentic behavior[46]. According to the model library, Qwen3-coder excels in both coding tasks and “agentic” tasks – meaning it can follow tool-using instructions or multi-step reasoning in code-like scenarios[47]. Some users prefer Qwen-3 14B as a coding model if they have around 12GB of GPU VRAM, noting it “does decently well and can handle a big context window”[48]. The benefit of Qwen is its training on both conversational and code data, giving it versatility. In practice, if you are generating complex code or using the model in an autonomous coding agent, Qwen-3 Coder is a good candidate. It might also have different quantization options (4-bit, etc.) to run on consumer GPUs despite the large parameter count. • Embedding Models (for RAG): For retrieval tasks, specialized embedding models are used to convert text into high-dimensional vectors. Ollama includes several ready-to-use embedding models that are optimized for semantic similarity. Three recommended ones are listed in the documentation[49]: • EmbeddingGemma: a 300M parameter embedding model from Google[50], which produces embeddings of (typically) 768 or 1024 dimensions. It’s trained to generate vector representations suitable for semantic search. This model is small and fast, making it ideal for indexing large document sets and querying them. • Qwen3-Embedding: an embedding model derived from the Qwen-3 series, available in sizes like 0.6B, 4B, 8B parameters[51]. Qwen’s embedding model provides a range of options; the smaller ones are efficient, while the larger might yield slightly more nuanced embeddings. These are useful if you plan to use a Qwen generator model, ensuring the same “family” for embeddings (though it’s not strictly necessary to match). • All-MiniLM: (implied by the reference to “all-minilm”) which refers to models like MiniLM or SBERT variants known for producing good embeddings quickly. Ollama likely provides a version of MiniLM (a lightweight Transformer model from Microsoft) tuned for all-around sentence embeddings. This would be a good choice when resources are very limited, as MiniLM models can be as small as 100M parameters and still give reasonable sentence similarity performance. These embedding models output a numeric vector for each input text. When building a RAG pipeline, you would use one of them to embed your documents and queries. Using the same embedding model for both indexing and querying is important to ensure vectors are comparable[52]. We will see how to generate embeddings with Ollama in the examples section (Section 5.3). In summary, embedding models in Ollama enable semantic search and retrieval — they are a key piece for applications where you need to find relevant pieces of text to feed into a larger LLM before getting an answer. • Other Notable Models: Beyond those highlights, Ollama’s library contains many other models. A few worth mentioning include: • Mistral: A newer series of models (e.g. Mistral 7B) known for strong performance relative to size, designed for efficient on-device operation[53]. Mistral 7B can be a great general-purpose model if one needs a smaller footprint but good output quality – it’s often cited for being on par with larger 13B models in some tasks, due to excellent training. • Steinerman StableCode: (Referred to as “Stable Code” in some sources[54]) – a 3B model fine-tuned for code completion, reportedly achieving similar capability to Code Llama 7B despite being much smaller[54]. This could be useful on hardware that can’t handle a 7B model. • Vision-Enabled Models: like LLaVA (Large Language and Vision Assistant)[55] or Qwen-VL[56], which can analyze images along with text. If your research involves multimodal inputs (e.g., describing an image or reading text from an image), these models would be relevant. They require additional steps to provide image data (often giving a path to an image in the prompt), and Ollama supports this as shown by the example ollama run llava "What's in this image? path/to/image.png"[57]. • Tool-using Agentic Models: Some models are fine-tuned to more readily use tools or follow chain-of-thought reasoning. For instance, Granite-4 and Nemotron series claim improved tool-use and agent capabilities[58]. These might be experimental but can be interesting if you want a model that more readily calls functions or external APIs when integrated with frameworks. In choosing a model, consider your task and your hardware. For coding, the specialized coding models (Code Llama, Qwen Coder, etc.) will give the best results. For summarization or general Q&A, a high-quality instruct model like Gemma 3 or Llama 2 would be suitable. For RAG, pair an embedding model with a strong generator (for example, use EmbeddingGemma or Qwen-Embedding to retrieve documents, and a larger model like Llama or GPT-OSS to generate the final answer from those documents). The good news is that with Ollama, you can easily try multiple models – since the interface to run them is the same, and many can co-exist on your system as long as you have storage space. In the next section, we will walk through practical examples demonstrating how to use some of these models for code generation, text summarization, and retrieval-augmented Q&A. ## Practical Examples: Using Ollama for Code, Summarization, and RAG In this section, we illustrate how to use Ollama-hosted models for three common tasks: code generation, text summarization, and retrieval-augmented query answering. We will show example prompts and describe how the models respond, using either the command-line interface or simple scripts. These examples assume you have installed Ollama and downloaded the mentioned models as per earlier sections. 5.1 Example: Code Generation with Code Llama One of the exciting applications of LLMs is assisting with programming tasks. Using a local model like Code Llama, you can get code suggestions or even have the model write entire functions for you. Let’s walk through an example scenario. Task: Suppose we want to generate a function that computes the Fibonacci sequence in Python. We’ll use the CodeLlama-Instruct model (7B size) for this, as it’s tuned to follow natural language instructions and produce helpful answers including code. Ensure you have the model pulled: ollama pull codellama:7b-instruct. Then run an interactive session by entering: ollama run codellama:7b-instruct Now, you’ll get a prompt (maybe >>>) where you can type your query. We provide the instruction: User: “You are an expert programmer that writes simple, concise code and explanations. Write a Python function to generate the nth Fibonacci number.” When you hit Enter, Code Llama will process this prompt and begin streaming its response. Because we used the instruct variant, it will likely first explain or clarify, then present code. For example, the output might look like: Certainly! Here's a simple Python function to compute the n-th Fibonacci number: def fibonacci(n): """Compute the nth Fibonacci number.""" if n <= 0: return 0 elif n == 1: return 1 else: return fibonacci(n-1) + fibonacci(n-2) This recursive function calls itself to sum the two preceding Fibonacci numbers until it reaches the base cases (0 or 1). *(The above is a hypothetical formatted output — the actual wording may vary, but Code Llama is known to produce correct Fibonacci code in answers.)* As we see, the model not only gave us the code but also a brief explanation, because we asked it to act as an expert who writes explanations. Code Llama’s training on conversational programming data enables this kind of response. If we wanted just the code without any explanation, we could prompt it more directly, or use the `codellama:7b-code` variant which is meant for pure code completion. For instance, using the code completion model, one could provide a prompt like: ollama run codellama:7b-code "# A Python function to compute Fibonacci numbers:\n" And the model would likely output the function code directly (possibly using the special infill tokens if needed for inserting into existing code). **Other coding uses:** Code Llama can handle tasks such as debugging and code review as well. For example, you can paste a piece of code and ask: *“Where is the bug in this code?”*, and the model will analyze it and point out potential issues[59]. You can also ask it to write unit tests for a given function, or to provide documentation strings. Another powerful feature is **fill-in-the-middle**: if you have a partially written code (with a gap in the middle), you can include `<PRE>` and `<SUF>` tokens around the prefix and suffix, and `<MID>` where the gap is, and Code Llama’s completion model will fill that gap with appropriate code[45]. This is useful for IDE integrations where the model autocompletes code in context. Using local code models is invaluable for researchers who want an AI coding assistant but need to maintain privacy (e.g., code for a sensitive project that cannot be sent to cloud services). While Code Llama’s performance is not quite at the level of OpenAI’s Codex or GPT-4 for code, it is quite competent for many tasks and continues to improve with community fine-tuning. ## Example: Text Summarization with a Conversational Model Summarizing text is a common task in digital research, such as condensing interview transcripts, literature, or reports. Here, we will demonstrate using an Ollama model to summarize a piece of text. Let’s assume we have a text file `report.txt` containing a few paragraphs of a report. Our goal is to produce a concise summary capturing the main points. **Choosing a model:** We could use any good general-purpose instruct model for summarization. For this example, let’s use **Gemma 3 (4B)**, which we have introduced earlier as a solid all-round model. Ensure it’s downloaded: `ollama pull gemma3` (by default that gets the 4B version). Then run an instruction prompt. We can do this in one command by using shell piping to feed the content of the file into the model. For instance, if on a Unix-like terminal (macOS) we can do: ollama run gemma3 "Summarize the following report:\n$(cat report.txt)" This command uses command substitution to insert the entire content of report.txt into the prompt after our instruction. (On Windows PowerShell, a similar approach can be taken using Get-Content to read a file into a variable and then passing it to ollama run.) Ollama accepts multi-line prompts and large context lengths (though keep the model’s context limit in mind – many 4B models might have up to 2048 tokens context, which is a few pages of text). The model will then output a summary. For example, if report.txt was a few paragraphs about economic trends, the output might be a paragraph like: "This report discusses recent economic trends, noting that inflation has stabilized at around 3%, unemployment is slightly down, and consumer spending has increased in Q4. Key drivers include improved supply chains and strong job growth in the tech sector. The report concludes that the economy is cautiously optimistic heading into the next year, though potential interest rate hikes by the central bank could temper growth." Again, the exact wording will vary, but the idea is that the model condenses the main points. If the initial summary is too verbose or not focused enough, one can prompt the model to be more concise: e.g., “Summarize the following text in 3-4 sentences.” The instruct-tuned models like Gemma 3 are generally good at following such directives to adjust length or style. Ollama also supports a special streaming mode, so if you were doing this interactively (without the $(cat file) trick), you could just copy-paste the text into the interactive session. Since a large paste might be cumbersome, the one-liner approach is convenient for automation. Indeed, the documentation demonstrates using ollama run llama3.2 "Summarize this file: $(cat README.md)" as a way to quickly summarize a file from the command line[60]. This could be integrated into scripts for batch summarization of documents. Summarization quality will depend on the model size and training. A 4B model can handle short texts reasonably but may miss nuances in longer documents. For more accuracy on long documents, you might consider a larger model (like a 13B or 30B model) or chunk the document and summarize in parts. There are also specialized summarizer models in the wild, though not explicitly listed in our earlier survey – some models are fine-tuned on summarization tasks or long context (for example, Llama-2 70B with 32k context could summarize very long texts if available via Ollama Cloud or similar). In any case, the workflow is: choose model -> feed text -> get summary, all without leaving your machine or sending data externally. ## Example: Retrieval-Augmented Q&A (RAG) For our final example, we demonstrate a basic Retrieval-Augmented Generation scenario using Ollama. Imagine you have a set of documents (e.g., a collection of articles or a knowledge base for a project), and you want to ask questions that require information from those documents. A RAG pipeline would typically involve: 1. Embedding the documents into a vector database, 2. Retrieving the most relevant documents given a query, and 3. Generating an answer using an LLM, conditioning on the retrieved documents. Let’s do a miniature version of this pipeline with Ollama. For simplicity, we’ll simulate a small “database” of just a few text snippets and show how to use Ollama’s embedding model and chat model. Setup: Suppose we have three short documents: - Doc1: “The Eiffel Tower is located in Paris and was completed in 1889.” - Doc2: “The Great Wall of China stretches over 13,000 miles and was built over many centuries.” - Doc3: “The Pyramids of Giza in Egypt were constructed roughly between 2580 and 2510 BC.” We load these into memory (in a real scenario, you’d have many more documents stored in a vector database like Chroma or Azure Cosmos DB, etc., but here we’ll just work with these manually). We’ll use the EmbeddingGemma model to encode these documents into vectors. Ensure the model is downloaded: ollama pull embeddinggemma. Now, using Python (with Ollama’s Python library installed via pip install ollama), we can embed and retrieve. But for clarity, let’s demonstrate using the command-line and a bit of conceptual explanation: • Embedding the Documents: We call Ollama’s embedding API or CLI on each document. For example: ollama run embeddinggemma "The Eiffel Tower is located in Paris and was completed in 1889." The output will be a JSON array representing the embedding vector for that sentence[61]. Ollama’s embedding endpoint returns L2-normalized vectors (unit-length), which is convenient for cosine similarity calculations[62]. We would do the same for the other two documents. Suppose we obtain three vectors: v1, v2, v3 corresponding to Doc1, Doc2, Doc3 respectively. • Querying: Now a user asks a question: “When was the Eiffel Tower built, and where is it located?”. We need to answer using our documents. We first embed the query in the same vector space: ollama run embeddinggemma "When was the Eiffel Tower built and where is it located?" This produces a vector q for the query. We then compute similarity (cosine similarity is recommended for semantic search[52]) between q and each document vector v1, v2, v3. In practice, one would use a vector database or a small script to do this calculation. Intuitively, q should be closest to v1 (which contains Eiffel Tower info) because the query is about the Eiffel Tower. So we identify Doc1 as the relevant piece. • Retrieval: We fetch Doc1 as the top result. (If multiple docs were relevant, we might take the top 2-3 and concatenate them or otherwise provide them to the model). • Generation with Context: Now we formulate a prompt for a larger language model that includes the content of Doc1 as reference. For instance, we might prompt our chat model (say, Llama 3 8B in this example, or we could use Gemma 3 or GPT-OSS for strong performance) with something like: Context: The Eiffel Tower is located in Paris and was completed in 1889. Question: When was the Eiffel Tower built, and where is it located? Answer: Essentially, we prepend the retrieved text as “Context” and then ask the question. It’s good to clearly delineate the context and question to the model. Using the Ollama Python API, this could be done as: from ollama import chat, ChatCompletion messages = [ {"role": "system", "content": "You are a helpful assistant that uses the provided context to answer the question."}, {"role": "user", "content": f"Context: {retrieved_doc}\n\nQuestion: {query}\nAnswer:"} ] response = chat(model="llama3:8b", messages=messages) print(response.message.content) The system message is optional but can guide the model to use context. The user message contains both context and the actual question. When we run this, the model will generate an answer using the context. Given our context, a correct answer would be: “The Eiffel Tower is located in Paris, France, and it was completed (built) in 1889.” The model might phrase it in a full sentence: “According to the provided information, the Eiffel Tower was completed in 1889 and it is located in Paris.” This demonstrates a simple RAG loop using Ollama: embed -> retrieve -> answer. Importantly, all steps are done locally. Embeddings were generated via Ollama’s embedding model, and the final answer came from a local LLM. If we were to scale this up, we’d use a proper vector store for efficiency. Indeed, a Microsoft Azure blog post provides a full tutorial on building a local RAG application with Ollama and Azure Cosmos DB (a NoSQL database with vector search)[63][64]. In that example, they used a high-quality 1024-dimension embedding model (mxbai-embed-large) and Meta’s Llama 3 (8B) for the Q&A, demonstrating that even an 8B model can be effective in RAG when given the right context[65]. The pipeline loads documents into the database, uses LangChain to retrieve relevant chunks via the embedding, and then calls Ollama for the final generation[66][67]. For integrating with such frameworks, read on to the next section. The key point from this example is that Ollama enables each component of RAG: you have the embedding models to vectorize text and the generative models to produce answers, all accessible through the same unified interface. Integrating Ollama into a RAG Pipeline (LangChain, LlamaIndex) Now that we have seen an example of a RAG process, you may wonder how to implement this cleanly in code. Fortunately, popular libraries like LangChain and LlamaIndex (formerly GPT Index) have started supporting local model backends like Ollama, making it straightforward to swap in a local LLM in place of an OpenAI API call. This section discusses how one can integrate Ollama models into these frameworks, allowing complex pipelines with minimal glue code. LangChain Integration: LangChain is a framework that provides abstractions for LLMs, prompts, memory, tools, and chains of calls. Ollama integration with LangChain is available and documented by the LangChain team[68]. There are essentially two types of integration components: - Chat Model: ChatOllama – which allows using an Ollama model as a conversational agent (following the chat interface with roles system). - Embedding Model: OllamaEmbeddings – which allows using an Ollama embedding model as a vectorizer within LangChain’s retrieval or vector store utilities. To use these, you typically need to install an integration package (for example, pip install langchain-ollama might be required, as indicated by the existence of a PyPI package[69]). Once set up, you can do something like: from langchain.llms import Ollama llm = Ollama(model="llama3:8b") # instantiate a LangChain LLM pointing to local Ollama model response = llm("Hello, how are you?") This would send the prompt to Ollama and return the response as a string. Under the hood, this uses Ollama’s API endpoint. You can also specify parameters like model_kwargs={"stream": False} or similar if needed. For embeddings: from langchain.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings(model="embeddinggemma") vec = embeddings.embed_query("sample text to vectorize") This will return a vector (list of floats) by calling Ollama’s embed API behind the scenes[70]. You can then use these vectors with LangChain’s vector store classes (like FAISS, Chroma, etc.) to build an index. LangChain’s documentation confirms that Ollama allows you to run open-source models locally and that it’s supported as a provider[68]. The integration means you can build a chain where, for example, a RetrievalQA chain uses OllamaEmbeddings for the retriever to embed the query and docs, and ChatOllama as the LLM to generate the final answer. This yields a fully self-contained pipeline on your machine. In fact, the Microsoft sample mentioned before uses LangChain with Ollama and a local database[67]. They show that after setting up Ollama and the DB, only minimal LangChain code is needed to orchestrate the RAG: LangChain handles splitting documents, creating embeddings via Ollama, storing them in the vector DB, and then on a query, retrieving and feeding context to the Ollama chat model[66][71]. LlamaIndex Integration: LlamaIndex is another library that simplifies connecting LLMs with your data (documents, SQL, etc.), often by building indices that can be queried. While as of this writing there may not be a native Ollama integration in LlamaIndex, it can work with any LLM that you can call programmatically. You could use the Ollama Python client or HTTP API to define a custom LLM class in LlamaIndex. The concept is similar: during index construction, use an Ollama embedding model to embed text; during query, use an Ollama chat model to get answers. The LangChain example can be translated logically to LlamaIndex by specifying a custom LLMPredictor that calls Ollama. The Azure blog post notes that the sample app was easily adaptable to LlamaIndex, meaning that swapping LangChain’s mechanism with LlamaIndex’s is straightforward once Ollama is accessible in code[64]. In summary, integrating Ollama into these pipelines typically involves pointing the framework to Ollama’s local API. Both LangChain and LlamaIndex are designed to be modular with respect to the LLM backend. By using the official or community-supported connectors (as with LangChain’s ChatOllama and OllamaEmbeddings), you can develop complex applications – like a chatbot that can cite from PDFs or a code assistant that refers to documentation – all running on local models provided by Ollama. This aligns with the interest of many researchers in maintaining control and privacy: your data and queries never leave your machine, yet you harness powerful AI capabilities. ## Ollama API and Programmatic Usage Ollama not only provides a CLI but also exposes a local HTTP API that developers can use to interact with the models from any programming language. This design makes it easy to integrate Ollama into custom applications, beyond the high-level frameworks discussed earlier. In this section, we describe how the API works and give examples of using it. Local REST API: By default, once the Ollama service is running, it listens on http://localhost:11434 for API requests[5]. There are multiple endpoints, corresponding to different operations: - POST /api/generate – for one-off text generation given a prompt (suitable for single-prompt completions). - POST /api/chat – for conversational interactions (allows multi-turn conversations by sending a list of messages with roles). - POST /api/embed – for generating embeddings from input text. - (There may be additional endpoints for listing models or other management functions, but the above are the primary ones for usage.) These endpoints expect JSON payloads and return responses typically in JSON format. Let’s look at each briefly: • /api/generate: This is a simple endpoint where you provide a model and a prompt, and you get back a completion (the model’s generated text). For example, using curl you could do: curl -X POST http://localhost:11434/api/generate -d '{ "model": "codellama", "prompt": "Write me a function that outputs the Fibonacci sequence", "stream": false }' This JSON payload specifies the model (codellama defaults to its base variant or instruct variant depending on config), the prompt string, and "stream": false indicating we want the full output in one response rather than streamed chunks. The response will be a JSON object that includes the model’s output text. In the Code Llama model card example, they show a similar usage of /api/generate to get a code snippet back[72]. If "stream": true is set, the API would send events or chunks that you’d have to read in sequence (usually via an HTTP chunked response). Streaming is useful for interactive applications where you want to show the answer as it is being generated. • /api/chat: This endpoint is used for multi-turn conversations. The payload format includes a list of messages, each with a "role" (such as "system", "user", "assistant") and "content" (the text of the message). For instance: { "model": "gemma3", "messages": [ {"role": "system", "content": "You are a helpful tutor."}, {"role": "user", "content": "Hello there!"} ], "stream": false } Posting this to /api/chat will get the model’s reply as if it were continuing a chat, considering the roles. The first message sets the context or persona, and the second is the user’s input. The API will return an assistant message as the completion. The Quickstart in the documentation demonstrates this usage: they send a user message “Hello there!” and get a response[31]. This interface aligns with the ChatGPT-style of interaction and supports maintaining a conversation by appending to the messages list. Note that Ollama does not inherently store the conversation between calls; the client needs to send the full message history each time (unless using some stateful feature, but stateless is the typical approach). So to have a back-and-forth, you would keep accumulating the conversation in the messages array and call /api/chat with it each time the user adds a new query. • /api/embed: This endpoint converts input text to embedding vectors. The JSON payload expects an "input" field which can be either a single string or an array of strings, and a "model" field specifying which embedding model to use. For example: { "model": "embeddinggemma", "input": ["The quick brown fox jumps over the lazy dog."] } Posting this to /api/embed will return a JSON with an "embeddings" field, which is an array of vectors (each vector itself being an array of numbers)[73]. In this case, since one sentence was given, the result might look like {"embeddings": [[0.0123, 0.085, ... ]]} – one vector inside an outer array. If you send multiple inputs (batch), you’ll get a vector for each input in order. Ollama ensures the vectors are normalized (unit length)[62]. Using curl, an example call from the docs is: curl -X POST http://localhost:11434/api/embed \ -H "Content-Type: application/json" \ -d '{"model": "embeddinggemma", "input": "The quick brown fox jumps over the lazy dog."}' which would yield the JSON embedding[74]. Programmatically, one can use the Python or JavaScript library to call this more easily. Ollama Client Libraries: To simplify API usage, the Ollama team provides official libraries, e.g., a Python package and a NodeJS package[75]. These wrap the REST calls in convenient functions. For instance, in Python after pip install ollama, you can do: from ollama import generate, chat, embed result = generate(model="codellama", prompt="Write a haiku about the ocean.") print(result) This generate call would internally call the /api/generate endpoint and return the text (or a structured object). The documentation’s quickstart provides an example using the chat() function in Python to have a model answer “Why is the sky blue?”[76]. Similarly, the JavaScript library (via npm i ollama) lets you call ollama.chat({...}) or ollama.embed({...}) for NodeJS applications[77][78]. These libraries handle constructing the HTTP requests and parsing the responses, so you as a developer can focus on the content of the interaction. Usage Example via API: Suppose you want to integrate Ollama into a simple web application. You could set up a backend route that, when it receives a request (say from a user form), it then calls Ollama’s API and returns the result. For example, a Flask (Python) endpoint might do: @app.route('/ask', methods=['POST']) def ask(): user_query = request.json['query'] # We maintain a conversation list in session or elsewhere conversation = session.get('messages', []) conversation.append({"role": "user", "content": user_query}) # Always include a system prompt for consistent behavior system_prompt = {"role": "system", "content": "You are a helpful assistant."} payload = {"model": "gpt-oss:20b", "messages": [system_prompt] + conversation, "stream": False} response = requests.post("http://localhost:11434/api/chat", json=payload) bot_reply = response.json()['message']['content'] conversation.append({"role": "assistant", "content": bot_reply}) session['messages'] = conversation # save updated conversation return {"answer": bot_reply} This pseudo-code shows how one might manage conversation state and call the API. We chose gpt-oss:20b as a model to have a powerful general answerer. The pattern would be similar for single-turn use (just omit the history and use /api/generate if you prefer). Performance Considerations: When using the API, remember that each call will engage the model for inference. For larger models, this can take several seconds (or more) per request, depending on hardware and prompt length. The API is stateless between calls (unless one uses the conversation in each request as shown). If you expect to handle many requests (like serving multiple users), be mindful of the model’s loading: Ollama’s server can handle concurrent requests up to some limit but running many large models at once can strain system resources. It might be wise to use a moderate-sized model for interactive use to keep latency reasonable. Closing the Loop – Example of Code via API: Let’s tie back to one of our earlier examples but using the API. For code generation, we could directly call the API without using the CLI. For instance, on Windows, using PowerShell, one could run: (Invoke-WebRequest -Method POST ` -Body '{"model":"codellama:7b-instruct", "prompt":"Write a function to check if a number is prime.", "stream": false}' ` -Uri http://localhost:11434/api/generate).Content | ConvertFrom-Json This PowerShell snippet hits the generate endpoint and then converts the JSON output to an object for easy reading[79]. The result would contain the code for a prime-checking function. As shown in the docs, the output is piped to ConvertFrom-Json to parse the JSON into a PowerShell object[79]. On other platforms, a similar curl command or use of the client library yields the result. In conclusion, Ollama’s API empowers integration and automation. Whether you are writing a small script, integrating into a research pipeline, or building an interactive app, you can treat Ollama’s local server much like you would treat an external AI service – except it’s all local. With endpoints for chat, generation, and embedding, it covers the main functionalities needed for intelligent applications. Coupled with the earlier sections on installation and models, you should now have a comprehensive understanding of how to set up, use, and programmatically leverage Ollama to run large language models on Windows or macOS, enabling advanced AI-driven research while retaining control over your tools and data. ## Conclusion In this chapter, we have provided a detailed overview of installing and using Ollama on Windows and macOS, and demonstrated its role in bringing powerful large language models to your local machine. Ollama serves as a bridge between cutting-edge AI models and everyday researchers – by simplifying installation, it allows those without deep technical expertise to experiment with LLMs in a secure, offline environment. We covered step-by-step installation processes, noting important considerations like system requirements, disk space, and configuration for both Windows and Mac. We then explained Ollama’s architecture and how it supports local model serving with a user-friendly CLI and API, highlighting the advantages of local deployment (privacy, control, customization). A survey of available models in the Ollama library showed the breadth of tasks you can tackle: from general-purpose dialogue (Gemma 3, GPT-OSS, Llama series) to specialized coding assistants (Code Llama, Qwen Coder) and components for retrieval systems (embedding models). Citing documentation and examples, we illustrated that many of these models rival the performance of cloud APIs, especially as open research rapidly improves LLM capabilities. By walking through practical examples – generating code, summarizing text, and performing retrieval-augmented Q&A – we translated the abstract capabilities into concrete workflows. These examples serve as templates that researchers can adapt to their own projects, whether it’s automating parts of coding, digesting large documents, or building a custom QA bot for a dataset. We also delved into integration with high-level frameworks like LangChain and LlamaIndex, emphasizing that Ollama can seamlessly plug into existing AI pipelines. This means you can incorporate local models into your research applications with minimal changes, benefiting from the ecosystem of tools built around LLMs. Finally, we detailed the Ollama API, with examples of how to call it and what responses to expect, reinforcing the idea that anything you can do with a cloud AI service, you can similarly do with Ollama via localhost – making it a versatile component in your computational toolkit. In scholarly contexts, the ability to run large language models locally is opening new avenues for experimentation in digital humanities, social science data analysis, and beyond. It enables compliance with data governance (since sensitive data never leaves the lab), and fosters reproducibility (since models and code can be shared and run without dependence on external services). As the landscape of open LLMs evolves, tools like Ollama will likely expand to support new models and features, perhaps larger context windows, more optimized runtimes, or collaborative features. By mastering the installation and usage of Ollama now, researchers equip themselves to take advantage of these advancements on their own terms. ## References • Ollama Documentation and Model Library – providing installation guides, model cards, and usage examples[5][41][61]. • Microsoft Azure Cosmos DB Blog by A. Gupta (2025) on Building a RAG application with Ollama – illustrating a practical local RAG implementation and the benefits of local LLMs[63][4]. • LangChain Documentation – confirming integration of Ollama for both chat and embeddings within standard LLM workflows[68][70]. • Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. – foundational paper introducing the RAG approach that underpins the retrieval examples discussed[6]. • Code Llama Model Card (Meta AI, 2023) – describes the capabilities of Code Llama in generating and understanding code, which aligns with our coding example[41][80]. These references and the citations throughout the chapter provide additional details and validation for the concepts covered. By following this guide and exploring the cited materials, researchers should be well-prepared to harness the power of local LLMs via Ollama for a wide range of interdisciplinary applications. [1] [2] [4] [28] [40] [63] [64] [65] [66] [67] [71] Build a RAG application with LangChain and Local LLMs powered by Ollama - Azure Cosmos DB Blog https://devblogs.microsoft.com/cosmosdb/build-a-rag-application-with-langchain-and-local-llms-powered-by-ollama/ [3] [75] Ollama's documentation - Ollama https://docs.ollama.com/ [5] [7] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [29] [79] Windows - Ollama https://docs.ollama.com/windows [6] [2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks https://arxiv.org/abs/2005.11401 [8] [27] [30] [35] [38] [55] [57] [60] GitHub - ollama/ollama: Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. https://github.com/ollama/ollama [19] [20] [21] [22] [23] [24] [25] [26] macOS - Ollama https://docs.ollama.com/macos [31] [32] [33] [76] [77] Quickstart - Ollama https://docs.ollama.com/quickstart [34] [37] [46] [47] [50] [51] [53] [56] [58] Ollama https://ollama.com/search [36] deepseek-r1 - Ollama https://ollama.com/library/deepseek-r1 [39] r/ollama on Reddit: Got DeepSeek R1 running locally - Full setup ... https://www.reddit.com/r/ollama/comments/1i6gmgq/got_deepseek_r1_running_locally_full_setup_guide/ [41] [42] [43] [44] [45] [59] [72] [80] codellama https://ollama.com/library/codellama [48] 3 ways to interact with Ollama | Ollama with LangChain - YouTube https://www.youtube.com/watch?v=cEv1ucRDoa0 [49] [52] [61] [62] [73] [74] [78] Embeddings - Ollama https://docs.ollama.com/capabilities/embeddings [54] library - Ollama https://ollama.com/library?q=code [68] [70] Ollama - Docs by LangChain https://docs.langchain.com/oss/python/integrations/providers/ollama [69] langchain-ollama - PyPI https://pypi.org/project/langchain-ollama/