Kernel: Ethernet Drivers

From TechPubs Wiki

Revision as of 07:04, 6 February 2026 by Raion (talk | contribs) (Initial Commit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This document describes the architecture and common patterns for Ethernet network drivers in IRIX based on analysis of three production drivers: Alteon Tigon (high-end), TI ThunderLAN (mid-range O2), and IOC3 (integrated ASIC). Understanding these patterns is essential for porting modern Ethernet controllers to IRIX.

Overview

IRIX Ethernet drivers follow a consistent architecture across all hardware:

  • PCI infrastructure integration via pciio_* APIs
  • Hardware graph (hwgraph) registration for device discovery
  • Etherif framework providing standard network interface
  • Descriptor ring management for DMA transfers
  • Interrupt handling at splimp() level
  • MII/PHY management for link negotiation
  • Packet scheduling (RSVP) support

Driver Initialization Flow

Module Loading

All drivers follow this initialization sequence:

  1. if_XXinit() - Called once at boot by edtinit()
    • Registers PCI vendor/device IDs via pciio_driver_register()
    • Registers idbg debug functions
  2. if_XXattach() - Called for each matching PCI device
    • Creates hwgraph vertices
    • Maps PCI memory regions
    • Allocates descriptor rings
    • Initializes hardware
    • Registers interrupt handlers
  3. if_XXopen() - Called when interface is configured
    • Gets unit number from device_controller_num_get()
    • Calls ether_attach() to register with network stack

Critical Initialization Order

1. pciio_driver_register() - Register with PCI subsystem
2. Allocate private state structure (ei/ti/ei_info)
3. hwgraph_char_device_add() - Create device vertex
4. pciio_piotrans_addr() - Map register spaces
5. Allocate DMA-able memory via contig_memalloc()
6. Initialize hardware (reset, PHY, MAC)
7. pciio_intr_alloc() + pciio_intr_connect()
8. ether_attach() - Final registration

Memory Management

DMA Memory Allocation

IRIX requires physically contiguous memory for DMA descriptor rings. Use contig_memalloc() with specific alignment requirements:

Descriptor Ring Requirements:

  • Alteon TX ring: 64K alignment for 512-entry ring
  • IOC3 TX ring: 64K alignment for 512-entry ring
  • IOC3 RX ring: 4K alignment
  • ThunderLAN: Page-aligned

Example (IOC3): <source lang="c"> size = NTXD * TXDSZ; // 512 * 128 bytes npgs = (size + NBPP - 1) / NBPP; pgno = contig_memalloc(npgs, ALIGNMENT_PAGES, VM_DIRECT); ei->txd = (struct eftxd*)small_pfntova_K0(pgno); </source>

Cache Coherency - CRITICAL

IRIX systems (especially Origin/Octane with Heart chipset) require explicit cache management:

Heart Coherency WAR: <source lang="c">

  1. if HEART_COHERENCY_WAR
  2. define CACHE_WB(addr, len) heart_dcache_wb_inval((addr), (len))
  3. else
  4. define CACHE_WB(addr, len)
  5. endif
  1. if HEART_INVALIDATE_WAR
  2. define CACHE_INVAL(addr, len) heart_invalidate_war((addr), (len))
  3. else
  4. define CACHE_INVAL(addr, len)
  5. endif

</source>

When to flush caches:

  • Before DMA read (RX): CACHE_INVAL() on receive buffers
  • Before DMA write (TX): CACHE_WB() on transmit data
  • After descriptor update: CACHE_WB() on descriptor memory
  • IP32 (O2) specific: Always use __vm_dcache_inval() and dki_dcache_wb()

PCI Infrastructure Integration

Address Translation

IRIX requires explicit DMA address translation for 64-bit capable devices:

Command DMA (descriptors, control): <source lang="c">

  1. define KVTOIOADDR_CMD(ei, addr) \
   pciio_dmatrans_addr((ei)->conn_vhdl, 0, \
       kvtophys((caddr_t)(addr)), sizeof(int), \
       PCIIO_DMA_A64 | PCIIO_DMA_CMD)

</source>

Data DMA (packet buffers): <source lang="c"> // Fast path using pre-allocated DMA map

  1. define KVTOIOADDR_DATA(ei, addr) \
   pciio_dmamap_addr((ei)->fastmap, \
       kvtophys((caddr_t)(addr)), sizeof(int))

// Allocate fast map during init: ei->fastmap = pciio_dmamap_alloc(conn_vhdl, dev_desc,

   sizeof(int), PCIIO_DMA_DATA | PCIIO_DMA_A64 | PCIIO_FIXED);

</source>

Interrupt Registration

<source lang="c"> // Set interrupt name and priority device_desc_intr_name_set(dev_desc, "Ethernet"); device_desc_intr_swlevel_set(dev_desc, (ilvl_t)splimp);

// Allocate and connect interrupt ei->intr = pciio_intr_alloc(conn_vhdl, dev_desc,

                            PCIIO_INTR_LINE_A, enet_vhdl);

pciio_intr_connect(ei->intr, (intr_func_t)if_XXintr,

                  (intr_arg_t)ei, NULL);

</source>

Descriptor Ring Architecture

All three drivers use producer/consumer ring buffers with hardware/software indices:

Common Ring Structure

<source lang="c"> struct driver_info {

   // TX ring
   struct txd *txd;           // TX descriptor array
   struct mbuf **txm;         // Pending TX mbufs
   int txhead;                // SW produce index
   int txtail;                // SW consume index
   
   // RX ring  
   struct rxd *rxd;           // RX descriptor array
   struct mbuf **rxm;         // Posted RX mbufs
   int rxhead;                // SW consume index
   int rxtail;                // SW produce index

}; </source>

Ring Index Management

Critical macro patterns: <source lang="c">

  1. define NEXTTXD(i) (((i) + 1) & (NTXD - 1)) // Ring must be power-of-2
  2. define DELTATXD(h,t) ((h - t) & (NTXD - 1)) // Free descriptors

</source>

IOC3-specific gotcha: Hardware compares RX indices modulo-16, so allocate 15 extra buffers: <source lang="c">

  1. define NRBUFADJ(n) (n + 15) // Required for IOC3 only

</source>

Transmit Path

Etherif Transmit Function

The transmit function must handle:

  1. Mbuf chain validation and coalescing
  2. Ethernet header insertion
  3. DMA address setup
  4. Descriptor programming
  5. Hardware checksum offload (if supported)
  6. Self-snoop for packet capture

Typical signature: <source lang="c"> static int XX_transmit(

   struct etherif *eif,
   struct etheraddr *edst,   // Destination MAC
   struct etheraddr *esrc,   // Source MAC  
   u_short type,             // EtherType
   struct mbuf *m0)          // Packet chain

</source>

Mbuf Alignment Requirements

Different hardware has different alignment constraints:

IOC3 requirements:

  • Data0 (in descriptor): must be even length if buf1 is used
  • Buf1 pointer: must be 2-byte aligned
  • Buf2 pointer: must be 128-byte aligned AND buf1 must be even length
  • No buffer may cross page boundary

Solution patterns: <source lang="c"> // Check alignment

  1. define MALIGN2(m) ALIGNED(mtod((m), u_long), 2)
  2. define MALIGN128(m) ALIGNED(mtod((m), u_long), 128)

// Copy if misaligned if (!MALIGN2(m0) || mbuf_crosses_page(m0))

   goto copyall;

copyall:

   m = m_vget(M_DONTWAIT, mlen, MT_DATA);
   m_datacopy(m0, 0, mlen, mtod(m, caddr_t));
   m_freem(m0);
   m0 = m;

</source>

Hardware Checksum Offload

IOC3 TX checksum: <source lang="c"> if (m0->m_flags & M_CKSUMMED) {

   struct ip *ip = mtod(m0, struct ip*);
   int hlen = ip->ip_hl << 2;
   
   // Compute pseudo-header checksum
   __uint32_t cksum = ((ip->ip_len - hlen)
       + htons((ushort)ip->ip_p)
       + (ip->ip_src.s_addr >> 16)
       + (ip->ip_src.s_addr & 0xffff)
       + (ip->ip_dst.s_addr >> 16)  
       + (ip->ip_dst.s_addr & 0xffff));
   
   // Subtract ethernet header contribution
   cksum += 0xffff ^ (type + MAC_checksum_contrib);
   
   // Store adjustment in TCP/UDP header
   mtod(m0, u_short*)[sumoff/2] = fold_checksum(cksum);
   
   // Set descriptor flags
   txd->cmd |= (sumoff << ETXD_CHKOFF_SHIFT) | ETXD_DOCHECKSUM;

} </source>

Errata: IOC3 rev 0 has a bug - it fails to take one's complement. Disable TX checksum for rev 0 chips.

Descriptor Update Pattern

<source lang="c"> // Fill descriptor fields txd->cmd = flags | length; txd->bufcnt = total_length; txd->p1 = KVTOIOADDR_DATA(ei, mbuf_data);

// CRITICAL: Writeback cache before hardware sees it CACHE_WB((void *)txd, TXDSZ);

// Save mbuf for later freeing ei->txm[ei->txhead] = m0;

// Update software index ei->txhead = NEXTTXD(ei->txhead);

// Notify hardware (write produce register) W_REG(ei->regs->etpir, (ei->txhead * TXDSZ)); </source>

Receive Path

RX Buffer Posting

Drivers must maintain a pool of posted receive buffers:

<source lang="c"> static void ef_fill(struct ef_info *ei) {

   int head = ei->rxhead;
   int tail = ei->rxtail;
   int n = ei->nrbuf - DELTARXD(head, tail);
   
   for (i = 0; i < n; i++) {
       m = m_vget(M_DONTWAIT, sizeof(struct rxbuf), MT_DATA);
       if (!m) break;
       
       rb = mtod(m, struct rxbuf*);
       ei->rxm[tail] = m;
       ei->rxd[tail] = KVTOIOADDR_DATA(ei, rb);
       
       // Invalidate for DMA
       CACHE_INVAL(rb, sizeof(struct rxbuf));
       CACHE_WB(mtod(m, caddr_t), sizeof(struct rxbuf));
       
       tail = NEXTRXD(tail);
   }
   
   ei->rxtail = tail;
   W_REG(ei->regs->erpir, (ERPIR_ARM | (tail * RXDSZ)));

} </source>

Interrupt Handler

IRIX interrupt handlers run at splimp() priority and must:

  1. Read/clear interrupt status
  2. Process received packets
  3. Reclaim completed TX descriptors
  4. Rearm interrupt

Typical pattern: <source lang="c"> static void if_XXintr(struct XX_info *ei) {

   int s;
   struct mbuf *mact, *m;
   
   s = mutex_bitlock(&ei->flags, EIF_LOCK);
   
   // Read and clear interrupt status
   isr = R_REG(ei->regs->eisr);
   W_REG(ei->regs->eisr, isr);
   
   // Process RX
   if (isr & EISR_RXTHRESH) {
       mact = ef_recv(ei);
   }
   
   // Reclaim TX
   if (isr & EISR_TXDONE) {
       ef_reclaim(ei);
   }
   
   // Handle errors
   if (isr & EISR_ERROR) {
       handle_errors(ei);
   }
   
   // Rearm
   W_REG(ei->regs->erpir, ERPIR_ARM | produce_index);
   
   mutex_bitunlock(&ei->flags, EIF_LOCK, s);
   
   // Send packets up AFTER releasing lock
   while (mact) {
       m = mact;
       mact = mact->m_act;
       ether_input(&ei->eif, 0, m);
   }

} </source>

RX Checksum Validation

<source lang="c"> if ((ntohs(eh->ether_type) == ETHERTYPE_IP)

   && (rlen >= 60)  
   && ((ip->ip_off & (IP_OFFMASK|IP_MF)) == 0)
   && ((ip->ip_p == IPPROTO_TCP) || (ip->ip_p == IPPROTO_UDP))) {
   
   // IOC3 provides checksum in descriptor
   cksum = rxd->w0 & ERXBUF_IPCKSUM_MASK;
   
   // Finish calculation (add pseudo-header, subtract ether header)
   // ...
   
   if (cksum == 0xffff)
       m->m_flags |= M_CKSUMMED;

} </source>

PHY Management (MII Interface)

PHY Detection and Reset

<source lang="c"> static int ef_phyprobe(struct ef_info *ei) {

   int i, r2, r3, val;
   
   for (i = 0; i < 32; i++) {
       ei->phyunit = i;
       r2 = ef_phyget(ei, MII_PHY_ID_HI);
       r3 = ef_phyget(ei, MII_PHY_ID_LO);
       val = (r2 << 12) | (r3 >> 4);
       
       switch (val) {
       case PHY_QS6612X:
       case PHY_ICS1889:
       case PHY_DP83840:
           ei->phytype = val;
           ei->phyrev = r3 & 0xf;
           return val;
       }
   }
   return -1;

} </source>

MII Read/Write

Hardware provides two main approaches:

Approach 1: Dedicated MII registers (IOC3, ThunderLAN): <source lang="c"> static int ef_phyget(struct ef_info *ei, int reg) {

   // Wait for ready
   while (regs->micr & MICR_BUSY)
       DELAYBUS(1);
       
   // Trigger read
   regs->micr = MICR_READTRIG | 
                (ei->phyunit << MICR_PHYADDR_SHIFT) | reg;
                
   // Wait for completion
   while (regs->micr & MICR_BUSY)
       DELAYBUS(1);
       
   return (regs->midr_r & MIDR_DATA_MASK);

} </source>

Approach 2: GPIO bit-banging (Alteon, ThunderLAN alt): <source lang="c"> // Send sync pattern (32 cycles of 1) // Send start bits (01) // Send opcode (10=read, 01=write) // Send PHY address (5 bits) // Send register (5 bits) // Turnaround and read data (16 bits) </source>

Auto-negotiation

<source lang="c"> static void ef_autonego(struct ef_info *ei) {

   int timeout = 20000;
   
   // Enable auto-negotiation
   ef_phyput(ei, MII_CTRL, MII_CTRL_AUTOEN | MII_CTRL_RESTART);
   
   // Wait for completion
   while (!(ef_phyget(ei, MII_STATUS) & MII_STATUS_ANDONE)) {
       DELAY(100);
       if (timeout-- <= 0) {
           // Default to 10Mbit half-duplex
           ei->speed100 = 0;
           ei->fullduplex = 0;
           return;
       }
   }
   
   // Parse negotiation result
   r4 = ef_phyget(ei, MII_AN_ADV);
   r5 = ef_phyget(ei, MII_AN_LPAR);
   
   if ((r4 & MII_TXFD) && (r5 & MII_TXFD)) {
       ei->speed100 = 1;
       ei->fullduplex = 1;
   } else if ((r4 & MII_TX) && (r5 & MII_TX)) {
       ei->speed100 = 1;
       ei->fullduplex = 0;
   }
   // ... etc

} </source>

Watchdog and Error Handling

Periodic Watchdog

<source lang="c"> static void ef_watchdog(struct ifnet *ifp) {

   struct ef_info *ei = eiftoei(ifptoeif(ifp));
   
   // Update statistics
   ef_getcdc(ei);  // Collision/defer counters
   
   // Check PHY status
   ef_phyerr(ei);
   
   // Reclaim TX descriptors
   ef_reclaim(ei);
   
   // Refill RX buffers
   ef_fill(ei);
   
   // Check for missed interrupts
   ef_intr(ei);
   
   // Reschedule
   ei->if_timer = WATCHDOG_INTERVAL;

} </source>

Link State Monitoring

<source lang="c"> static void ef_phyerr(struct ef_info *ei) {

   int reg1 = ef_phyget(ei, MII_STATUS);
   
   // Link status (latched low, read twice)
   if (!(reg1 & MII_STATUS_LINK)) {
       if (!(ei->flags & EIF_LINKDOWN)) {
           cmn_err(CE_WARN, "ef%d: link fail - check cable", 
                   ei->unit);
           ei->flags |= EIF_LINKDOWN;
       }
   } else {
       if (ei->flags & EIF_LINKDOWN) {
           cmn_err(CE_NOTE, "ef%d: link ok", ei->unit);
           ei->flags &= ~EIF_LINKDOWN;
           ef_reinit(ei);  // May need to adjust speed/duplex
       }
   }
   
   // Check for remote fault, jabber, etc.

} </source>

RSVP/Packet Scheduling

IRIX supports packet scheduling for QoS. Drivers must:

  1. Report available TX descriptors
  2. Signal when descriptors become available
  3. Optionally interrupt on each TX completion when scheduling active

<source lang="c"> static int ef_txfree_len(struct ifnet *ifp) {

   struct ef_info *ei = eiftoei(ifptoeif(ifp));
   return (NTXD - DELTATXD(ei->txhead, ei->txtail));

}

static void ef_setstate(struct ifnet *ifp, int setting) {

   struct ef_info *ei = eiftoei(ifptoeif(ifp));
   if (setting)
       ei->flags |= EIF_PSENABLED;
   else
       ei->flags &= ~EIF_PSENABLED;

}

// In transmit path: if ((ei->flags & EIF_PSENABLED) &&

   ((ei->txhead & ei->intfreq) == 0)) {
   txd->cmd |= ETXD_INTWHENDONE;

}

// In interrupt handler: if (ei->flags & EIF_PSENABLED) {

   ps_txq_stat(eiftoifp(&ei->eif), ef_txfree_len(&ei->if));

}

// During init: struct ps_parms ps_params; ps_params.bandwidth = ei->speed100 ? 100000000 : 10000000; ps_params.flags = 0; ps_params.txfree = NTXD; ps_params.txfree_func = ef_txfree_len; ps_params.state_func = ef_setstate; ps_init(eiftoifp(&ei->eif), &ps_params); </source>

Locking and Synchronization

IRIX uses bitlocks for driver synchronization:

<source lang="c"> // Initialize lock during attach init_bitlock(&ei->flags, EIF_LOCK, "ef drv lock", 0);

// Acquire in all paths that touch hardware s = mutex_bitlock(&ei->flags, EIF_LOCK); // ... critical section ... mutex_bitunlock(&ei->flags, EIF_LOCK, s);

// Check if locked (for assertions) ASSERT(ei->flags & EIF_LOCK); </source>

Lock hierarchy:

  1. ifnet lock (IFNET_LOCK) - Highest level, held during ioctl
  2. Driver bitlock - Protects hardware access and descriptor rings
  3. Separate TX/RX locks - Some drivers use separate locks for TX and RX paths

Critical sections that need locking:

  • Descriptor ring manipulation
  • Hardware register access
  • mbuf list operations
  • Statistics updates

Operations outside lock:

  • ether_input() - Network stack must be called without driver lock
  • m_freem() - Can be called outside lock

Common Pitfalls and Gotchas

Cache Coherency

Problem: Forgetting cache flushes causes random packet corruption.

Solution: Always CACHE_WB before DMA write, CACHE_INVAL before DMA read.

Descriptor Alignment

Problem: Hardware silently uses wrong memory if alignment is incorrect.

Solution: Use contig_memalloc with correct alignment parameter.

mbuf Chain Fragmentation

Problem: Each mbuf may require a separate DMA descriptor, exhausting hardware resources.

Solution: Coalesce into single mbuf if chain too long or poorly aligned.

Register Write Ordering

Problem: Posted writes may not complete before hardware starts DMA.

Solution: Read back critical registers to force write completion: <source lang="c"> W_REG(ei->regs->emcr, EMCR_RST); R_REG(ei->regs->emcr); // Force write to complete </source>

IOC3 RX Index Comparison

Problem: Hardware compares RX indices modulo-16, causing premature ring full condition.

Solution: Post 15 extra buffers (NRBUFADJ macro).

Interrupt Re-arming

Problem: Forgetting to rearm causes no more interrupts.

Solution: Always write to interrupt rearm register at end of handler.

Link State During Init

Problem: Reading MII status during PHY reset gives false link-down errors.

Solution: Delay link state checking until after auto-negotiation completes.

Multicast Filter Collisions

Problem: Multiple multicast addresses may hash to same filter bit.

Solution: Maintain collision counter; keep filter bit set if any addresses hash to it.

Hardware-Specific Considerations

Alteon Tigon

  • Firmware loading - Must load code/data into on-chip SRAM
  • Event ring - Asynchronous events in addition to interrupts
  • Large SRAM - Can hold many packets on-chip
  • Sophisticated coalescing - Configurable interrupt coalescing

TI ThunderLAN

  • List-based DMA - Multiple fragments per packet
  • Integrated PHY - On-chip 10Mbit PHY plus MII for external PHY
  • EEPROM access - Complex protocol for reading MAC address
  • Dual channels - Separate TX/RX channels with independent DMA

IOC3

  • Integrated PCI-X bridge - Shares functions with serial/parallel ports
  • Hardware checksumming - TX and RX checksum offload
  • SSRAM buffer - On-chip packet buffer (64KB or 128KB)
  • Parity checking - Optional parity on SSRAM
  • NIC EEPROM - Number-in-a-can for MAC address

Testing and Debugging

Debug Print Macros

<source lang="c">

  1. define DBG_PRINT(a) if (ei->if_flags & IFF_DEBUG) printf a

// Usage: DBG_PRINT(("ef%d: txhead=%d txtail=%d\n", unit, txhead, txtail)); </source>

Dump Functions

Every driver should provide idbg dump functions: <source lang="c"> idbg_addfunc("ef_dump", (void (*)())ef_dump);

static void ef_dump(int unit) {

   ef_dumpif(ifp);    // Interface statistics
   ef_dumpei(ei);     // Driver private state
   ef_dumpregs(ei);   // Hardware registers
   ef_dumpphy(ei);    // PHY registers

} </source>

Common Debug Points

  • RX ring full - Check if ef_fill() is being called
  • TX ring full - Check if ef_reclaim() is working
  • Checksum failures - Verify pseudo-header calculation
  • Link flapping - Check PHY registers and cable
  • DMA errors - Verify descriptor alignment and cache flushing
  • Interrupt storms - Ensure status register is being cleared

Porting Checklist

When porting a new Ethernet controller to IRIX:

1. PCI Infrastructure

  • [ ] Register vendor/device ID in if_XXinit()
  • [ ] Map register spaces with pciio_piotrans_addr()
  • [ ] Enable bus master and memory space in PCI config
  • [ ] Set cache line size and latency timer

2. Memory Management

  • [ ] Allocate descriptor rings with contig_memalloc()
  • [ ] Verify alignment requirements met
  • [ ] Allocate mbuf tracking arrays with kmem_zalloc()
  • [ ] Create fast DMA map with pciio_dmamap_alloc()

3. Hardware Initialization

  • [ ] Reset chip and verify self-test passes
  • [ ] Probe and initialize PHY
  • [ ] Configure MAC address from EEPROM/NIC
  • [ ] Set up descriptor ring base addresses
  • [ ] Configure DMA parameters (burst size, etc.)
  • [ ] Enable checksumming if supported

4. Descriptor Management

  • [ ] Implement ring index macros (NEXT, DELTA)
  • [ ] Handle cache coherency (CACHE_WB/CACHE_INVAL)
  • [ ] Translate addresses with KVTOIOADDR_CMD/DATA
  • [ ] Post initial RX buffers

5. Interrupt Handling

  • [ ] Register handler with pciio_intr_alloc/connect
  • [ ] Read and clear interrupt status
  • [ ] Process RX and TX completions
  • [ ] Handle error conditions
  • [ ] Rearm interrupts

6. Network Integration

  • [ ] Implement etherif operations vector
  • [ ] Call ether_attach() with correct inventory type
  • [ ] Support multicast filtering
  • [ ] Implement SIOCADDMULTI/SIOCDELMULTI ioctls
  • [ ] Handle IFF_PROMISC and IFF_ALLMULTI

7. Packet Scheduling

  • [ ] Implement txfree_len callback
  • [ ] Implement setstate callback
  • [ ] Call ps_init() during initialization
  • [ ] Optionally interrupt per-packet when PS enabled

8. Testing

  • [ ] Basic ping test (small packets)
  • [ ] Large transfer test (TCP bulk data)
  • [ ] Multicast functionality
  • [ ] Promiscuous mode (tcpdump)
  • [ ] Link up/down handling
  • [ ] Error recovery (cable unplug/replug)
  • [ ] Performance testing (ttcp, netperf)

Summary

IRIX Ethernet drivers follow a well-defined architecture that balances hardware flexibility with software consistency. Key principles:

  • Explicit cache management for DMA coherency
  • Physically contiguous descriptor rings
  • Producer/consumer rings with power-of-2 sizes
  • Bitlocks for synchronization
  • PHY auto-negotiation with error recovery
  • Hardware/software index split for lock-free updates
  • PCI infrastructure integration for portability

Understanding these patterns enables efficient porting of modern Ethernet controllers to the IRIX platform while avoiding common pitfalls around cache coherency, alignment, and interrupt handling.