Linux Kernel Networking
Linux Kernel Networking
RamiRosen
ramirose@gmail.com
Haifux,August2007
Disclaimer
Everything in this lecture shall not, under any
circumstances, hold any legal liability whatsoever.
Any usage of the data and information in this document
shall be solely on the responsibility of the user.
This lecture is not given on behalf of any company
or organization.
Warning
Thislecturewilldealwithdesignfunctional
descriptionsidebysidewithmanyimplementationdetails;
someknowledgeofCispreferred.
General
TheLinuxnetworkingkernelcode(includingnetworkdevice
drivers)isalargepartoftheLinuxkernelcode.
Scope:Wewillnotdealwithwireless,IPv6,andmulticasting.
Alsonotwithuserspaceroutingdaemons/apps,andwith
securityattacks(likeDoS,spoofing,etc.).
Understandingapacketwalkthroughinthekernelisakeyto
understandingkernelnetworking.Understandingitisamustif
wewanttounderstandNetfilterorIPSecinternals,andmore.
Thereisa10pagesLinuxkernelnetworkingwalkthrouhdocument
GeneralContd.
Thoughitdealswith2.4.20Linuxkernel,mostofitisrelevant.
Thislecturewillconcentrateonthiswalkthrough(designand
implementationdetails).
Referencestocodeinthislecturearebasedonlinux2.6.23rc2.
Therewassomeseriouscleanupin2.6.23
Hierarchyofnetworkinglayers
Thelayersthatwewilldealwith(basedonthe7layersmodel)are:
TransportLayer(L4)(udp,tcp...)
NetworkLayer(L3)(ip)
LinkLayer(L2)(ethernet)
NetworkingDataStructures
Thetwomostimportantstructuresoflinuxkernelnetworklayer
are:
sk_buff(definedininclude/linux/skbuff.h)
netdevice(definedininclude/linux/netdevice.h)
Itisbettertoknowabitaboutthembeforedelvingintothe
walkthroughcode.
SK_BUFF
sk_buffrepresentsdataandheaders.
sk_buffAPI(examples)
sk_buffallocationisdonewithalloc_skb()ordev_alloc_skb();
driversusedev_alloc_skb();.(freebykfree_skb()and
dev_kfree_skb().
unsignedchar*data:pointstothecurrentheader.
skb_pull(intlen)removesdatafromthestartofabufferby
advancingdatatodata+lenandbydecreasinglen.
Almostalwayssk_buffinstancesappearasskbinthekernel
code.
SK_BUFFcontd
sk_buffincludes3unions;eachcorrespondstoakernelnetwork
layer:
transport_header(previouslycalledh)forlayer4,thetransport
layer(canincludetcpheaderorudpheaderoricmpheader,and
more)
network_header(previouslycallednh)forlayer3,thenetwork
layer(canincludeipheaderoripv6headerorarpheader).
mac_header(previouslycalledmac)forlayer2,thelinklayer.
skb_network_header(skb),skb_transport_header(skb)and
skb_mac_header(skb)returnpointertotheheader.
SK_BUFFcontd.
structdst_entry*dsttherouteforthissk_buff;thisrouteis
determinedbytheroutingsubsystem.
Ithas2importantfunctionpointers:
int(*input)(structsk_buff*);
int (*output)(structsk_buff*);
input()canbeassignedtooneofthefollowing:ip_local_deliver,
ip_forward,ip_mr_input,ip_errorordst_discard_in.
output()canbeassignedtooneofthefollowing:ip_output,
ip_mc_output,ip_rt_bug,ordst_discard_out.
SK_BUFFcontd.
Intheusualcase,thereisonlyonedst_entryforeveryskb.
WhenusingIPSec,thereisalinkedlistofdst_entriesandonlythe
lastoneisforrouting;allotherdst_entriesareforIPSec
transformers;theseotherdst_entrieshavetheDST_NOHASH
flagset.
tstamp(oftypektime_t):timestampofreceivingthepacket.
net_enable_timestamp()mustbecalledinordertogetvalues.
net_device
net_devicerepresentsanetworkinterfacecard.
Therearecaseswhenweworkwithvirtualdevices.
Forexample,bonding(settingthesameIPfortwoormore
NICs,forloadbalancingandforhighavailability.)
Manytimesthisisimplementedusingtheprivatedataofthe
device(thevoid*privmemberofnet_device);
InOpenSolaristhereisaspecialpseudodrivercalledvnic
whichenablesbandwidthallocation(projectCrossBow).
Importantmembers:
net_devicecontd
unsignedintmtuMaximumTransmissionUnit:themaximum
sizeofframethedevicecanhandle.
Eachprotocolhasmtuofitsown;thedefaultis1500forEthernet.
youcanchangethemtuwithifconfig;forexample,likethis:
ifconfigeth0mtu1400
Youcannotofcourse,changeittovalueshigherthan1500on
10Mb/snetwork:
ifconfigeth0mtu1501willgive:
SIOCSIFMTU:Invalidargument
net_devicecontd
unsignedintflags(whichyouseeorsetusingifconfigutility):
forexample,RUNNINGorNOARP.
unsignedchardev_addr[MAX_ADDR_LEN]:theMACaddress
ofthedevice(6bytes).
int (*hard_start_xmit)(structsk_buff*skb,
structnet_device*dev);
apointertothedevicetransmitmethod.
int promiscuity;(acounterofthetimesaNICistoldtosetto
workinpromiscuousmode;usedtoenablemorethanonesniffing
client.)
net_devicecontd
YouarelikelytoencountermacrosstartingwithIN_DEVlike:
IN_DEV_FORWARD()orIN_DEV_RX_REDIRECTS().Howarethe
relatedtonet_device?Howarethesemacrosimplemented?
void*ip_ptr:IPv4specificdata.Thispointerisassignedtoa
pointertoin_deviceininetdev_init()(net/ipv4/devinet.c)
net_deviceContd.
structin_devicehaveamembernamedcnf(instanceof
ipv4_devconf).Setting/proc/sys/net/ipv4/conf/all/forwarding
eventuallysetstheforwardingmemberofin_deviceto1.
Thesameistruetoaccept_redirectsandsend_redirects;both
arealsomembersofcnf(ipv4_devconf).
Inmostdistros,/proc/sys/net/ipv4/conf/all/forwarding=0
ButprobablythisisnotsoonyourADSLrouter.
networkinterfacedrivers
MostofthenicsarePCIdevices;therearealsosomeUSB
networkdevices.
ThedriversfornetworkPCIdevicesusethegenericPCIcalls,like
pci_register_driver()andpci_enable_device().
FormoreinfoonnicdrivesseethearticleWritingNetwork
DeviceDriverforLinux(linkno.9inlinks)andchap17inldd3.
TherearetwomodesinwhichaNICcanreceiveapacket.
Thetraditionalwayisinterruptdriven:eachreceivedpacketis
anasynchronouseventwhichcausesaninterrupt.
NAPI
NAPI(newAPI).
TheNICworksinpollingmode.
Inorderthatthenicwillworkinpollingmodeitshouldbebuilt
withaproperflag.
Mostofthenewdriverssupportthisfeature.
WhenworkingwithNAPIandwhenthereisaveryhighload,
packetsarelost;butthisoccursbeforetheyarefedintothe
networkstack.(inthenonNAPIdrivertheypassintothestack)
inSolaris,pollingisbuiltintothekernel(noneedtobuild
UserSpaceTools
iputils(includingping,arping,andmore)
nettools(ifconfig,netstat,,route,arpandmore)
IPROUTE2(ipcommandwithmanyoptions)
UsesrtnetlinkAPI.
Hasmuchwiderfunctionalities;forexample,youcancreate
tunnelswithipcommand.
Note:noneedfornflagwhenusingIPROUTE2(becauseit
doesnotworkwithDNS).
RoutingSubsystem
Theroutingtableandtheroutingcacheenableustofindthenet
deviceandtheaddressofthehosttowhichapacketwillbesent.
Readingentriesintheroutingtableisdonebycalling
fib_lookup(conststructflowi*flp,structfib_result*res)
FIBistheForwardingInformationBase.
Therearetworoutingtablesbydefault:(nonPolicyRoutingcase)
localFIBtable(ip_fib_local_table;ID255).
mainFIBtable(ip_fib_main_table;ID254)
See:include/net/ip_fib.h.
RoutingSubsystemcontd.
Routescanbeaddedintothemainroutingtableinoneof3ways:
Bysysadmincommand(routeadd/iproute).
Byroutingdaemons.
AsaresultofICMP(REDIRECT).
Aroutingtableisimplementedbystructfib_table.
RoutingTables
fib_lookup()firstsearchesthelocalFIBtable(ip_fib_local_table).
Incaseitdoesnotfindanentry,itlooksinthemainFIBtable
(ip_fib_main_table).
Whyisitinthisorder?
Thereisoneroutingcache,regardlessofhowmanyroutingtables
thereare.
YoucanseetheroutingcachebyrunningrouteC.
Alternatively,youcanseeitby:cat/proc/net/rt_cache.
con:thisway,theaddressesareinhexformat
RoutingCache
Theroutingcacheisbuiltofrtableelements:
structrtable(see:/include/net/route.h)
{
union{
structdst_entry dst;
}u;
...
RoutingCachecontd
Thedst_entryistheprotocolindependentpart.
Thus,forexample,wehaveadst_entrymember(also
calleddst)inrt6_infoinipv6.(include/net/ip6_fib.h)
ThekeyforalookupoperationintheroutingcacheisanIP
address(whereasintheroutingtablethekeyisasubnet).
Insertingelementsintotheroutingcacheby:rt_intern_hash()
Thereisanalternatemechanismforroutecachelookup,
calledfib_trie,whichisinsidethekerneltree
(net/ipv4/fib_trie.c)
RoutingCachecontd
Itisbasedonextendingthelookupkey.
Youshouldset:CONFIG_IP_FIB_TRIE(=y)
(insteadofCONFIG_IP_FIB_HASH)
ByRobertOlssonetal(seelinks).
CreatingaRoutingCacheEntry
Allocationofrtableinstance(rth)isdoneby:dst_alloc().
Settinginputandoutputmethodsofdst:
(rth>u.dst.inputandrth>u.dst.input)
Settingtheflowimemberofdst(rth>fl)
dst_alloc()infactcreatesandreturnsapointerto
dst_entryandwecastittortable(net/core/dst.c).
Nexttimethereisalookupinthecache,forexample,
ip_route_input(),wewillcompareagainstrth>fl.
RoutingCacheContd.
Agarbagecollectioncallwhichdelete
eligibleentriesfromtheroutingcache.
Whichentriesarenoteligible?
PolicyRouting(multipletables)
Genericroutingusesdestinationaddressbaseddecisions.
Therearecaseswhenthedestinationaddressisnotthesole
parametertodecidewhichroutetogive;PolicyRoutingcomesto
enablethis.
PolicyRouting(multipletables)contd.
Addingaroutingtable:byaddingalineto:/etc/iproute2/rt_tables.
Forexample:addtheline252my_rt_table.
Therecanbeupto255routingtables.
Policyroutingshouldbeenabledwhenbuildingthekernel
(CONFIG_IP_MULTIPLE_TABLESshouldbeset.)
Exampleofaddingarouteinthistable:
>iprouteadddefaultvia192.168.0.1tablemy_rt_table
Showthetableby:
iprouteshowtablemy_rt_table
PolicyRouting(multipletables)contd.
Youcanaddaruletotheroutingpolicydatabase(RPDB)
byipruleadd...
Therulecanbebasedoninputinterface,TOS,fwmark
(fromnetfilter).
iprulelistshowallrules.
PolicyRouting:add/deletearuleexample
ipruleaddtos0x04table252
Thiswillcausepacketswithtos=0x08(intheiphdr)
toberoutedbylookingintothetableweadded(252)
Sothedefaultgwforthesetypeofpacketswillbe
192.168.0.1
ipruleshowwillgive:
32765:fromalltosreliabilitylookupmy_rt_table
...
PolicyRouting:add/deletearuleexample
Deletearule:ipruledeltos0x04table252
RoutingLookup
ip_route_input()in:net/ipv4/route.c
Cachelookup
Hit
Miss
ip_route_input_slow()
in:net/ipv4/route.c
fib_lookup()in
ip_fib_local_table
Hit
Miss
Droppacket
orip_forward()
accordingtoresult
Miss
fib_lookup()in
ip_fib_main_table
Deliverpacketby:
ip_local_deliver()
Hit
RoutingTableDiagram
fib_table
33
tb_lookup()
tb_insert()
tb_delete()
structfn_zone
structfn_zone
...
...
structfn_zone
structfn_zone
fz_hash
structfib_node fib_node
hlist_head
hlist_head
fz_divisor
hlist_head
...
hlist_head
fn_alias
fn_alias
fn_key
fn_key
structfib_alias
fa_info
structfib_info
fib_nh
RoutingTables
Breakingthefib_tableintomultipledatastructuresgives
flexibilityandenablesfinegrainedandhighlevelofsharing.
Supposethatwe10routesto10differentnetworkshave
thesamenexthopgw.
Wecanhaveonefib_infowhichwillbesharedby10
fib_aliases.
fz_divisoristhenumberofbuckets
RoutingTablescontd
Eachfib_nodeelementrepresentsauniquesubnet.
Thefn_keymemberoffib_nodeisthesubnet(32bit)
RoutingTablescontd
Supposethatadevicegoesdownorenabled.
Weneedtodisable/enableallrouteswhichusethisdevice.
Buthowcanweknowwhichroutesusethisdevice?
Inordertoknowitefficiently,thereisthefib_info_devhash
table.
Thistableisindexedbythedeviceidentifier.
Seefib_sync_down()andfib_sync_up()in
net/ipv4/fib_semantics.c
RoutingTablelookupalgorithm
LPM(LongestPrefixMatch)isthelookupalgorithm.
Theroutewiththelongestnetmaskistheonechosen.
Netmask0,whichistheshortestnetmask,isforthedefault
gateway.
Whathappenswhentherearemultipleentrieswith
netmask=0?
fib_lookup()returnsthefirstentryitfindsinthefibtable
wherenetmasklengthis0.
RoutingTablelookupcontd.
Itmaybethatthisisnotthebestchoicedefaultgateway.
Soincasethatnetmaskis0(prefixlenofthefib_resultreturned
fromfib_lookis0)wecallfib_select_default().
fib_select_default()willselecttheroutewiththelowestpriority
(metric)(bycomparingtofib_priorityvaluesofalldefault
gateways).
Receivingapacket
Whenworkingininterruptdrivenmodel,thenicregistersan
interrupthandlerwiththeIRQwithwhichthedeviceworksby
callingrequest_irq().
Thisinterrupthandlerwillbecalledwhenaframeisreceived
Thesameinterrupthandlerwillbecalledwhentransmissionofa
frameisfinishedandunderotherconditions.(dependsonthe
NIC;sometimes,theinterrupthandlerwillbecalledwhenthereis
someerror).
Receivingapacketcontd
Typicallyinthehandler,weallocatesk_buffbycalling
dev_alloc_skb();alsoeth_type_trans()iscalled;amongother
thingsitadvancesthedatapointerofthesk_bufftopointtotheIP
header;thisisdonebycallingskb_pull(skb,ETH_HLEN).
See:net/ethernet/eth.c
ETH_HLENis14,thesizeofethernetheader.
Receivingapacketcontd
Thehandlerforreceivingapacketisip_rcv().(net/ipv4/ip_input.c)
Handlerfortheprotocolsareregisteredatinitphase.
Likewise,arp_rcv()isthehandlerforARPpackets.
First,ip_rcv()performssomesanitychecks.Forexample:
if(iph>ihl<5||iph>version!=4)
gotoinhdr_error;
iphistheipheader;iph>ihlistheipheaderlength(4bits).
Theipheadermustbeatleast20bytes.
Itcanbeupto60bytes(whenweuseipoptions)
Receivingapacketcontd
Thenitcallsip_rcv_finish(),by:
NF_HOOK(PF_INET,NF_IP_PRE_ROUTING,skb,dev,NULL,
ip_rcv_finish);
Thisdivisionofmethodsintotwostages(wherethesecondhas
thesamenamewiththesuffixfinishorslow,istypicalfor
networkingkernelcode.)
Inmanycasesthesecondmethodhasaslowsuffixinsteadof
finish;thisusuallyhappenswhenthefirstmethodlooksinsome
cacheandthesecondmethodperformsalookupinatable,which
isslower.
Receivingapacketcontd
ip_rcv_finish()implementation:
if(skb>dst==NULL){
interr=ip_route_input(skb,iph>daddr,iph>saddr,iph>tos,
skb>dev);
...
}
...
returndst_input(skb);
Receivingapacketcontd
ip_route_input():
Firstperformsalookupintheroutingcachetoseeifthereisa
match.Ifthereisnomatch(cachemiss),calls
ip_route_input_slow()toperformalookupintheroutingtable.
(Thislookupisdonebycallingfib_lookup()).
fib_lookup(conststructflowi*flp,structfib_result*res)
Theresultsarekeptinfib_result.
ip_route_input()returns0uponsuccessfullookup.(alsowhen
thereisacachemissbutasuccessfullookupintheroutingtable.)
Receivingapacketcontd
Accordingtotheresultsoffib_lookup(),weknowiftheframeisfor
localdeliveryorforforwardingortobedropped.
Iftheframeisforlocaldelivery,wewillsettheinput()function
pointeroftheroutetoip_local_deliver():
rth>u.dst.input=ip_local_deliver;
Iftheframeistobeforwarded,wewillsettheinput()function
pointertoip_forward():
rth>u.dst.input=ip_forward;
LocalDelivery
Prototype:
ip_local_deliver(structsk_buff*skb)(net/ipv4/ip_input.c).
callsNF_HOOK(PF_INET,NF_IP_LOCAL_IN,skb,skb>dev,
NULL,ip_local_deliver_finish);
Deliversthepackettothehigherprotocollayersaccordingtoits
type.
Forwarding
Prototype:
intip_forward(structsk_buff*skb)
(net/ipv4/ip_forward.c)
decreasesthettlintheipheader
Ifthettlis<=1,themethodssendICMPmessage
(ICMP_TIME_EXCEEDED)anddropsthepacket.
CallsNF_HOOK(PF_INET,NF_IP_FORWARD,skb,skb>dev,
rt>u.dst.dev,ip_forward_finish);
ForwardingContd
ip_forward_finish():sendsthepacketoutbycalling
dst_output(skb).
dst_output(skb)isjustawrapper,whichcalls
skb>dst>output(skb).(seeinclude/net/dst.h)
SendingaPacket
Handlingofsendingapacketisdoneby
ip_route_output_key().
Weneedtoperformroutinglookupalsointhecaseof
transmission.
Incaseofacachemiss,wecallsip_route_output_slow(),
whichlooksintheroutingtable(bycallingfib_lookup(),as
alsoisdoneinip_route_input_slow().)
Ifthepacketisforaremotehost,wesetdst>outputto
ip_output()
SendingaPacketcontd
ip_output()willcallip_finish_output()
ThisistheNF_IP_POST_ROUTINGpoint.
ip_finish_output()willeventuallysendthepacketfroma
neighborby:
dst>neighbour>output(skb)
arp_bind_neighbour()seestoitthattheL2addressofthe
nexthopwillbeknown.(net/ipv4/arp.c)
SendingaPacketContd.
Ifthepacketisforthelocalmachine:
dst>output=ip_output
dst>input=ip_local_deliver
ip_output()willsendthepacketontheloopbackdevice,
Thenwewillgointoip_rcv()andip_rcv_finish(),butthis
timedstisNOTnull;sowewillendinip_local_deliver().
See:net/ipv4/route.c
Multipathrouting
Thisfeatureenablestheadministratortosetmultiplenext
hopsforadestination.
Toenablemultipathrouting,
CONFIG_IP_ROUTE_MULTIPATHshouldbesetwhen
buildingthekernel.
Therewasalsoanoptionformultipathcaching:(bysetting
CONFIG_IP_ROUTE_MULTIPATH_CACHED).
Itwasexperimentalandremovedin2.6.23Seelinks(6).
Netfilter
Netfilteristhekernellayertosupportapplyingiptablesrultes.
Itenables:
Filtering
Changingpackets(masquerading)
ConnectionTracking
Netfilterruleexample
Shortexample:
Applyingthefollowingiptablesrule:
iptablesAINPUTpudpdport9999jDROP
ThisisNF_IP_LOCAL_INrule;
Thepacketwillgoto:
ip_rcv()
andthen:ip_rcv_finish()
Andthenip_local_deliver()
Netfilterruleexample(contd)
butitwillNOTproceedtoip_local_deliver_finish()asinthe
usualcase,withoutthisrule.
Asaresultofapplyingthisruleitreachesnf_hook_slow()
withverdict==NF_DROP(callsskb_free()tofreethepacket)
See/net/netfilter/core.c.
ICMPredirectmessage
ICMPprotocolisusedtonotifyaboutproblems.
AREDIRECTmessageissentincasetheroute
issuboptimal(inefficient).
Thereareinfact4typesofREDIRECT
Onlyoneisused:
RedirectHost(ICMP_REDIR_HOST)
SeeRFC1812(RequirementsforIPVersion4Routers).
ICMPredirectmessagecontd.
TosupportsendingICMPredirects,themachineshouldbe
configuredtosendredirectmessages.
/proc/sys/net/ipv4/conf/all/send_redirectsshouldbe1.
Inorderthattheothersidewillreceiveredirects,weshould
set
/proc/sys/net/ipv4/conf/all/accept_redirectsto1.
ICMPredirectmessagecontd.
Example:
Addasuboptimalrouteon192.168.0.31:
routeaddnet192.168.0.10netmask255.255.255.255gw
192.168.0.121
Runningnowrouteon192.168.0.31willshowanewentry:
DestinationGatewayGenmaskFlagsMetricRefUseIface
192.168.0.10192.168.0.121255.255.255.255UGH000eth0
ICMPredirectmessagecontd.
Sendpacketsfrom192.168.0.31to192.168.0.10:
ping192.168.0.10(from192.168.0.31)
Wewillsee(on192.168.0.31):
From192.168.0.121:icmp_seq=2RedirectHost(New
nexthop:192.168.0.10)
now,runningon192.168.0.121:
routeCn|grep.10
showsthatthereisanewentryintheroutingcache:
ICMPredirectmessagecontd.
192.168.0.31192.168.0.10192.168.0.10ri0034eth0
Therintheflagscolumnmeans:RTCF_DOREDIRECT.
The192.168.0.121machinehadsentaredirectbycalling
ip_rt_send_redirect()fromip_forward().
(net/ipv4/ip_forward.c)
ICMPredirectmessagecontd.
Andon192.168.0.31,runningrouteC|grep.10shows
nowanewentryintheroutingcache:(incase
accept_redirects=1)
192.168.0.31192.168.0.10192.168.0.10001
eth0
Incaseaccept_redirects=0(on192.168.0.31),wewillsee:
192.168.0.31192.168.0.10192.168.0.121000eth0
whichmeansthatthegwisstill192.168.0.121(whichisthe
ICMPredirectmessagecontd.
Addinganentrytotheroutingcacheasaresultofgetting
ICMPREDIRECTisdoneinip_rt_redirect(),net/ipv4/route.c.
Theentryintheroutingtableisnotdeleted.
NeighboringSubsystem
Mostknownprotocol:ARP(inIPV6:ND,neighbourdiscovery)
ARPtable.
Ethernetheaderis14byteslong:
Sourcemacaddress(6bytes).
Destinationmacaddress(6bytes).
Type(2bytes).
0x0800isthetypeforIPpacket(ETH_P_IP)
0x0806isthetypeforARPpacket(ETH_P_ARP)
see:include/linux/if_ether.h
NeighboringSubsystemcontd
WhenthereisnoentryintheARPcacheforthedestinationIP
addressofapacket,abroadcastissent(ARPrequest,
ARPOP_REQUEST:whohasIPaddressx.y.z...).Thisisdoneby
amethodcalledarp_solicit().(net/ipv4/arp.c)
Youcanseethecontentsofthearptablebyrunning:
cat/proc/net/arporbyrunningthearpfromacommandline.
Youcandeleteandaddentriestothearptable;seemanarp.
BridgingSubsystem
Youcandefineabridgeandadd NICstoit(enslaving
ports)usingbrctl(frombridgeutils).
Youcanhaveupto1024portsforeverybridgedevice
(BR_MAX_PORTS).
Example:
brctladdbrmybr
brctladdifmybreth0
brctlshow
BridgingSubsystemcontd.
WhenaNICisconfiguredasabridgeport,thebr_port
memberofnet_deviceisinitialized.
(br_portisaninstanceofstructnet_bridge_port).
Whenwereceiveaframe,netif_receive_skb()calls
handle_bridge().
BridgingSubsystemcontd.
Thebridgingforwardingdatabaseissearchedforthe
destinationMACaddress.
Incaseofahit,theframeissenttothebridgeportwith
br_forward()(net/bridge/br_forward.c).
Ifthereisamiss,theframeisfloodedonall
bridgeportsusingbr_flood()(net/bridge/br_forward.c).
Note:thisisnotabroadcast!
TheebtablesmechanismistheL2parallelofL3Netfilter.
BridgingSubsystemcontd
Ebtablesenableustofilterandmanglepackets
atthelinklayer(L2).
IPSec
WorksatnetworkIPlayer(L3)
UsedinmanyformsofsecurednetworkslikeVPNs.
MandatoryinIPv6.(notinIPv4)
Implementedinmanyoperatingsystems:Linux,Solaris,Windows,
andmore.
RFC2401
In2.6kernel:implementedbyDaveMillerandAlexeyKuznetsov.
Transformationbundles.
Chainofdstentries;onlythelastoneisforrouting.
IPSeccont.
Userspacetools:http://ipsectools.sf.net
BuildingVPN:http://www.openswan.org/(OpenSource).
TherearealsononIPSecsolutionsforVPN
example:pptp
structxfrm_policyhasthefollowingmember:
structdst_entry*bundles.
__xfrm4_bundle_create()createsdst_entries(withthe
DST_NOHASHflag)see:net/ipv4/xfrm4_policy.c
TransportModeandTunnelMode.
IPSeccontd.
Showthesecuritypolicies:
ipxfrmpolicyshow
CreateRSAkeys:
ipsecrsasigkeyverbose2048>keys.txt
ipsecshowhostkeyleft>left.publickey
ipsecshowhostkeyright>right.publickey
IPSeccontd.
Example:HosttoHostVPN(usingopenswan)
in/etc/ipsec.conf:
connlinuxtolinux
left=192.168.0.189
leftnexthop=%direct
leftrsasigkey=0sAQPPQ...
right=192.168.0.45
rightnexthop=%direct
rightrsasigkey=0sAQNwb...
type=tunnel
auto=start
IPSeccontd.
serviceipsecstart(tostarttheservice)
ipsecverifyCheckyoursystemtoseeifIPsecgotinstalledand
startedcorrectly.
ipsecautostatus
IfyouseeIPsecSAestablished,thisimpliessuccess.
Lookforerrorsin/var/log/secure(fedoracore)orinkernelsyslog
Tipsforhacking
Documentation/networking/ipsysctl.txt:networkingkerneltunabels
Exampleofreadingahexaddress:
iph>daddr==0x0A00A8C0or
meanscheckingiftheaddressis192.168.0.10(C0=192,A8=168,
00=0,0A=10).
TipsforhackingContd.
Disablepingreply:
echo1>/proc/sys/net/ipv4/icmp_echo_ignore_all
Disablearp:iplinkseteth0arpoff(theNOARPflagwillbeset)
Alsoifconfigeth0arphasthesameeffect.
HowcanyougetthePathMTUtoadestination(PMTU)?
Usetracepath(seemantracepath).
Tracepathisfromiputils.
TipsforhackingContd.
Keepiphdrstructhandy(printout):(fromlinux/ip.h)
structiphdr{
__u8
ihl:4,
version:4;
__u8 tos;
__be16
tot_len;
__be16
id;
__be16
frag_off;
__u8 ttl;
__u8 protocol;
__sum16
check;
__be32
saddr;
__be32
daddr;
/*Theoptionsstarthere.*/
};
TipsforhackingContd.
NIPQUAD():macroforprintinghexaddresses
CONFIG_NET_DMAisforTCP/IPoffload.
Whenyouencounter:xfrm/CONFIG_XFRMthishastotodowith
IPSEC.(transformers).
Newandfuturetrends
IO/AT.
NetChannels(VanJacobsonandEvgeniyPolyakov).
TCPOffloading.
RDMA.
Mulitqueus.:somenewnics,likee1000andIPW2200,
allowtwoormorehardwareTxqueues.Therearealready
patchestoenablethis.
Newandfuturetrendscontd.
See:EnablingLinuxNetworkSupportofHardware
MultiqueueDevices,OLS2007.
Somemoreinfoin:Documentation/networking/multiqueue.txt
inrecentLinuxkernels.
DeviceswithmultipleTX/RXqueueswillhavethe
NETIF_F_MULTI_QUEUEfeature(include/linux/netdevice.h)
MQnicdriverswillcallalloc_etherdev_mq()or
alloc_netdev_mq()insteadofalloc_etherdev()or
alloc_netdev().
Linksandmoreinfo
1)LinuxNetworkStackWalkthrough(2.4.20):
http://gicl.cs.drexel.edu/people/sevy/network/Linux_network_stack_walkth
2)UnderstandingtheLinuxKernel,SecondEdition
ByDanielP.Bovet,MarcoCesati
SecondEditionDecember2002
chapter18:networking.
UnderstandingLinuxNetworkInternals,Christianbenvenuti
Oreilly,FirstEdition.
Linksandmoreinfo
3)LinuxDeviceDriver,byJonathanCorbet,AlessandroRubini,Greg
KroahHartman
ThirdEditionFebruary2005.
Chapter17,NetworkDrivers
4)Linuxnetworking:(alotofdocsaboutspecificnetworkingtopics)
http://linuxnet.osdl.org/index.php/Main_Page
5)netdevmailinglist:http://www.spinics.net/lists/netdev/
Linksandmoreinfo
6)Removalofmultipathroutingcachefromkernelcode:
http://lists.openwall.net/netdev/2007/03/12/76
http://lwn.net/Articles/241465/
7)LinuxAdvancedRouting&TrafficControl:
http://lartc.org/
8)ebtablesafilteringtoolforabridging:
http://ebtables.sourceforge.net/
Linksandmoreinfo
9)WritingNetworkDeviceDriverforLinux:(article)
http://app.linux.org.mt/article/writingnetdrivers?locale=en
Linksandmoreinfo
10)Netconfayearlynetworkingconference;firstwasin2004.
http://vger.kernel.org/netconf2004.html
http://vger.kernel.org/netconf2005.html
http://vger.kernel.org/netconf2006.html
Nextone:LinuxConfAustralia,January2008,Melbourne
DavidS.Miller,JamesMorris,RustyRussell,JamalHadiSalim,StephenHemminger
,HaraldWelte,HideakiYOSHIFUJI,HerbertXu,ThomasGraf,RobertOlsson,Arnaldo
CarvalhodeMeloandothers
Linksandmoreinfo
11)PolicyRoutingWithLinuxOnlineBookEdition
byMatthewG.Marsh(Sams).
http://www.policyrouting.org/PolicyRoutingBook/
12)THRASHAdynamicLCtrieandhashdatastructure:
RobertOlssonStefanNilsson,August2006
http://www.csc.kth.se/~snilsson/public/papers/trash/trash.pdf
13)IPSechowto:
http://www.ipsechowto.org/t1.html
Linksandmoreinfo
14)Openswan:BuildingandIntegratingVirtualPrivate
Networks,byPaulWouters,KenBantoft
http://www.packtpub.com/book/openswan/mid/061205jqdnh2by
publisher:PacktPublishing.