Glean企业搜索架构集成

v20260423

glean-reference-architecture

本文件提供了企业级搜索的参考架构，指导如何将来自Confluence、Jira、Notion等多个分散的内部知识源连接到Glean搜索平台。它详细描述了从数据源连接器、消息队列、权限同步到最终搜索展示的全生命周期流程，旨在实现企业数据的统一、可靠和低延迟搜索。

企业搜索架构设计系统集成 Glean 知识管理 SaaS 索引云平台

获取技能

430 次下载

概览

Glean Reference Architecture

Overview

Enterprise search integration architecture for connecting internal knowledge systems to Glean's indexing and search platform. Designed for organizations needing unified search across Confluence, Google Drive, Notion, Slack, Jira, and custom internal tools. Key design drivers: connector reliability for continuous indexing, permission synchronization to enforce source-system ACLs, incremental vs bulk indexing tradeoffs, and low-latency search aggregation across heterogeneous document types.

Architecture Diagram

Source Systems ──→ Connector Framework ──→ Queue (SQS) ──→ Glean Indexing API
(Confluence, Drive,    (Cloud Run)              ↓            /indexing/documents
 Notion, Slack, Jira)       ↓              Permission Sync  /indexing/permissions
                      Schedule (cron) ──→  Bulk Reindexer   /indexing/datasources
                            ↓
                      Glean Search Index ──→ Client API ──→ Your Apps
                                              /search       (Slack bot, portal)
                                              /chat         (internal tools)

Service Layer

class ConnectorService {
  constructor(private glean: GleanIndexingClient, private cache: CacheLayer) {}

  async indexDocument(doc: SourceDocument): Promise<void> {
    const gleanDoc = this.transformToGleanFormat(doc);
    await this.glean.indexDocument(doc.datasource, gleanDoc);
    await this.syncPermissions(doc.id, doc.acl);
  }

  async bulkReindex(datasource: string, since?: string): Promise<IndexReport> {
    const docs = await this.fetchAllDocuments(datasource, since);
    const batches = this.chunk(docs, 100);  // Glean recommends batches of 100
    let indexed = 0;
    for (const batch of batches) {
      await this.glean.bulkIndex(datasource, batch);
      indexed += batch.length;
    }
    return { datasource, totalIndexed: indexed, timestamp: new Date().toISOString() };
  }
}

Caching Strategy

const CACHE_CONFIG = {
  searchResults:  { ttl: 30,   prefix: 'search' },   // 30s — freshness critical for search
  permissions:    { ttl: 300,  prefix: 'perm' },      // 5 min — ACL changes are infrequent
  datasources:    { ttl: 3600, prefix: 'ds' },        // 1 hr — datasource config rarely changes
  connectorState: { ttl: 60,   prefix: 'conn' },      // 1 min — sync cursor freshness
  documentMeta:   { ttl: 120,  prefix: 'docmeta' },   // 2 min — title/author for search previews
};
// Webhook-driven invalidation: source system change events flush document cache immediately

Event Pipeline

class IndexingPipeline {
  private queue = new Bull('glean-indexing', { redis: process.env.REDIS_URL });

  async onSourceChange(event: SourceChangeEvent): Promise<void> {
    await this.queue.add(event.type, event, { attempts: 5, backoff: { type: 'exponential', delay: 3000 } });
  }

  async processDocumentChange(event: DocumentChangeEvent): Promise<void> {
    if (event.action === 'deleted') await this.glean.deleteDocument(event.datasource, event.docId);
    else await this.connector.indexDocument(await this.fetchDoc(event.datasource, event.docId));
  }

  async processPermissionChange(event: PermissionChangeEvent): Promise<void> {
    await this.glean.syncPermissions(event.datasource, event.docId, event.newAcl);
  }
}

Data Model

interface SourceDocument  { id: string; datasource: string; title: string; body: string; url: string; author: string; updatedAt: string; acl: Permission[]; }
interface Permission      { type: 'user' | 'group' | 'domain'; value: string; access: 'read' | 'write'; }
interface ConnectorState  { datasource: string; lastSyncCursor: string; lastFullReindex: string; documentCount: number; status: 'healthy' | 'degraded' | 'failed'; }
interface IndexReport     { datasource: string; totalIndexed: number; failures: string[]; timestamp: string; }

Scaling Considerations

Deploy one connector instance per datasource to isolate failures and rate limits
Schedule bulk reindexing during off-peak hours — Glean indexing API has per-datasource throughput limits
Use incremental sync (change cursors) for high-frequency sources (Slack, Jira) to minimize API calls
Permission sync is the bottleneck — batch ACL updates and run as a separate queue consumer
Monitor connector health per datasource; alert on sync lag > 15 minutes for critical sources

Error Handling

Component	Failure Mode	Recovery
Connector sync	Source API rate limit	Per-datasource backoff, degrade to hourly bulk sync
Document indexing	Glean 429 throughput limit	Queue retry with jitter, batch size reduction
Permission sync	ACL mismatch between source and Glean	Reconciliation job flags discrepancies for admin review
Bulk reindex	Timeout on large datasource	Checkpoint cursor, resume from last successful batch
Search aggregation	Stale index for one datasource	Degrade gracefully — return results from healthy sources, flag staleness

Resources

Next Steps

See glean-deploy-integration.

信息

Category 编程开发

Name glean-reference-architecture

版本 v20260423

大小 5.88KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-28