技能 产品商业 Glean数据索引与治理

Glean数据索引与治理

v20260423
glean-data-handling
该技能专注于企业级数据的安全摄取和索引管理。它负责从各种连接器(如Google Drive, Confluence等)导入文档,核心功能包括在索引前自动过滤敏感个人信息(PII),严格维护权限边界,并强制执行数据保留策略,从而确保符合GDPR和CCPA等全球数据隐私法规。
获取技能
55 次下载
概览

Glean Data Handling

Overview

Glean enterprise search ingests documents from dozens of connectors (Google Drive, Confluence, Slack, Jira, Salesforce, etc.) and builds a unified search index with permission-aware access control. Data types include indexed document content, connector metadata, user permission maps, query logs, and search analytics. All document content must be PII-filtered before indexing, permission boundaries must be preserved to prevent data leakage across teams, and retention policies must be enforced to comply with corporate governance and GDPR/CCPA obligations.

Data Classification

Data Type Sensitivity Retention Encryption
Indexed document content High (may contain PII) Per source retention policy AES-256 at rest
User permission maps High (access control) Sync lifecycle TLS + at rest
Connector metadata Medium Until connector removed AES-256 at rest
Search query logs Medium (reveals intent) 90 days default AES-256 at rest
Search analytics/aggregates Low 1 year TLS in transit

Data Import

interface GleanDocument {
  id: string; datasource: string; title: string;
  body: string; permissions: { allowedUsers?: string[]; allowAnonymousAccess?: boolean };
  updatedAt: string; url: string;
}

async function indexDocuments(docs: GleanDocument[], datasource: string) {
  // PII strip before indexing
  const sanitized = docs.map(doc => ({
    ...doc,
    body: stripPII(doc.body),
  }));
  // Batch upload with pagination (max 100 per request)
  for (let i = 0; i < sanitized.length; i += 100) {
    const batch = sanitized.slice(i, i + 100);
    await fetch(`https://customer-be.glean.com/api/index/v1/bulkindexdocuments`, {
      method: 'POST',
      headers: { Authorization: `Bearer ${process.env.GLEAN_INDEXING_TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ datasource, documents: batch }),
    });
  }
}

function stripPII(text: string): string {
  return text
    .replace(/\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, '[EMAIL_REDACTED]')
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE_REDACTED]')
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN_REDACTED]');
}

Data Export

async function exportSearchAnalytics(startDate: string, endDate: string) {
  const res = await fetch(`https://customer-be.glean.com/api/v1/analytics`, {
    method: 'POST',
    headers: { Authorization: `Bearer ${process.env.GLEAN_API_TOKEN}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ startDate, endDate, metrics: ['query_count', 'click_through', 'zero_results'] }),
  });
  const data = await res.json();
  // Redact user identifiers from analytics export
  return data.results.map((r: any) => ({ ...r, userId: undefined, query: r.query?.length > 3 ? r.query : '[SHORT_QUERY_REDACTED]' }));
}

Data Validation

function validateDocument(doc: GleanDocument): string[] {
  const errors: string[] = [];
  if (!doc.id || doc.id.length > 512) errors.push('Invalid document ID');
  if (!doc.datasource) errors.push('Missing datasource identifier');
  if (!doc.title || doc.title.length > 1000) errors.push('Title missing or exceeds 1000 chars');
  if (!doc.body || doc.body.length === 0) errors.push('Empty document body');
  if (!doc.permissions) errors.push('Missing permissions — defaults to deny-all');
  if (doc.updatedAt && isNaN(Date.parse(doc.updatedAt))) errors.push('Invalid updatedAt timestamp');
  return errors;
}

Compliance

  • PII stripped from document body before indexing (emails, phones, SSNs)
  • Permission boundaries enforced: allowedUsers scope matches source system ACLs
  • Connector credentials stored in secret manager, rotated quarterly
  • Search query logs retained max 90 days, purged via automated job
  • GDPR right-to-erasure: delete all indexed content referencing a specific user on request
  • CCPA: honor do-not-sell signals for search analytics data
  • SOC 2 Type II audit trail for all indexing and deletion operations

Error Handling

Issue Cause Fix
403 on bulk index Expired or insufficient indexing token Rotate token, verify datasource permissions
Permission mismatch in search Stale ACL sync from connector Force re-sync connector permissions via admin API
PII detected in indexed content New PII pattern not in strip regex Add pattern to stripPII, re-index affected datasource
Zero-result queries spike Connector sync failure, stale index Check connector health dashboard, trigger manual re-crawl
Rate limit 429 on indexing Batch size too large or too frequent Reduce batch to 50 docs, add 500ms delay between batches

Resources

Next Steps

See glean-security-basics.

信息
Category 产品商业
Name glean-data-handling
版本 v20260423
大小 5.32KB
更新时间 2026-04-28
语言