Skills Development Managing Glean Indexing and Data Connectors

Managing Glean Indexing and Data Connectors

v20260423
glean-core-workflow-b
This skill provides a comprehensive workflow for integrating external data sources into the Glean enterprise search platform. It covers setting up custom datasources, performing high-volume bulk document indexing using the Indexing API, and granularly managing document permissions (read/write access) to ensure accurate and secure search results for large organizations. Ideal for data engineers and developers building internal knowledge bases.
Get Skill
463 downloads
Overview

Glean Core Workflow B: Indexing & Connectors

Overview

Build custom Glean connectors: set up datasources, bulk index documents, manage content lifecycle, and configure permissions.

Instructions

Step 1: Create Custom Datasource

await fetch(`${GLEAN}/index/v1/adddatasource`, {
  method: 'POST', headers: idxHeaders,
  body: JSON.stringify({
    name: 'internal_docs',
    displayName: 'Internal Documentation',
    datasourceCategory: 'PUBLISHED_CONTENT',
    urlRegex: 'https://docs.internal.company.com/.*',
    isOnPrem: false,
  }),
});

Step 2: Bulk Index Documents

// Bulk indexing replaces ALL documents in the datasource
const uploadId = `upload-${Date.now()}`;

// Send documents in batches of 100
for (let i = 0; i < allDocs.length; i += 100) {
  const batch = allDocs.slice(i, i + 100);
  const isFirst = i === 0;
  const isLast = i + 100 >= allDocs.length;

  await fetch(`${GLEAN}/index/v1/bulkindexdocuments`, {
    method: 'POST', headers: idxHeaders,
    body: JSON.stringify({
      datasource: 'internal_docs',
      uploadId,
      isFirstPage: isFirst,
      isLastPage: isLast,
      documents: batch.map(doc => ({
        id: doc.id,
        title: doc.title,
        url: doc.url,
        body: { mimeType: 'text/html', textContent: doc.content },
        author: { email: doc.authorEmail },
        updatedAt: doc.updatedAt,
        permissions: { allowAnonymousAccess: true },
      })),
    }),
  });
  console.log(`Indexed batch ${i/100 + 1} (${batch.length} docs)`);
}

Step 3: Set Document Permissions

// Control who can see documents in search results
await fetch(`${GLEAN}/index/v1/indexdocuments`, {
  method: 'POST', headers: idxHeaders,
  body: JSON.stringify({
    datasource: 'internal_docs',
    documents: [{
      id: 'confidential-001',
      title: 'Board Meeting Notes',
      url: 'https://docs.internal.company.com/board/q1-2025',
      body: { mimeType: 'text/plain', textContent: '...' },
      permissions: {
        allowedUsers: [{ email: 'ceo@company.com' }, { email: 'cfo@company.com' }],
      },
    }],
  }),
});

Step 4: Delete Documents

// Remove specific documents from the index
await fetch(`${GLEAN}/index/v1/deletedocument`, {
  method: 'POST', headers: idxHeaders,
  body: JSON.stringify({
    datasource: 'internal_docs',
    objectType: 'Document',
    id: 'doc-to-delete',
  }),
});

Error Handling

Error Cause Solution
uploadId already used Reusing bulk upload ID Generate unique uploadId per run
document too large Content exceeds limit Truncate body to ~100KB
invalid permissions Malformed user/group Use valid email addresses

Resources

Next Steps

For common errors, see glean-common-errors.

Info
Category Development
Name glean-core-workflow-b
Version v20260423
Size 3.52KB
Updated At 2026-04-26
Language